Is there a service/feature to read text content from URL?

Sara Albashtawi 0 Reputation points
2025-01-20T14:33:10.3433333+00:00

I have a current RAG solution which takes data from a blob container, indexes it, and then when the index is queried the results are passed to an OpenAI model to aid generating grounding results. Now, I want a low-code/no-code solution to pass a webpage URL and get the textual content in that page (basically scrape that page) then pass it to the OpenAI model to give me a summary/report about that page, then save that report into the Index I already have so that I can include that as additional information in my RAG solution.

I have tried to use the Use Your Data feature for URLs but apparently it takes the URL, reads the data, and creates a new AI Search Index. This is not what I'm aiming for, my target is to read the data, create a document out of it and then adding that document to an existing index.

Is there any service/feature on Azure that would help me achieve this?

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
3,582 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Saideep Anchuri 1,370 Reputation points Microsoft Vendor
    2025-01-21T05:46:10.7133333+00:00

    Hi Sara Albashtawi

    Welcome to Microsoft Q&A Forum, thank you for posting your query here!

    You can index external website using Azure Cognitive Search indexing too.

    There are also several solutions available in the Azure Marketplace that can help you scrape content from a website URL. One such solution is the "Web Data Connector" by DataChant, which is a Power BI custom connector that allows you to extract data from web pages and import it into Power BI.

    Another solution is the "Web Scraping" solution by Scrapinghub, which is a cloud-based web scraping platform that allows you to extract data from websites at scale.

    Once saved to blob storage, you can create indexes out of it, but please make sure to preprocess scraped data from URL so that it can be indexed properly.

    Reference thread: external-webs

    Thank You


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.