Use Azure AI services to automate document identification, classification, and natural language processing search

Azure Functions
Azure OpenAI Service
Azure AI services
Azure AI Search
Azure AI Document Intelligence

This article describes an architecture that you can use to process various documents. The architecture uses the durable functions feature of Azure Functions to implement pipelines. The pipelines process documents via Azure AI Document Intelligence for document splitting, named entity recognition (NER), and classification. Documents content and metadata are used for Retrieval Augmented Generation (RAG) based natural language processing.

Architecture

Diagram that shows an architecture to identify, classify, and search documents.

Download a Visio file of this architecture.

Workflow

  1. A user uploads a document file to a web app. The file contains multiple embedded documents of various types, such as PDF or multipage Tag Image File Format (TIFF) files. The document file is stored in Azure Blob Storage (1a). To initiate pipeline processing, the web app adds a command message to a Service Bus queue (1b).

  2. The command message triggers the durable functions orchestration. The message contains metadata that identifies the Blob Storage location of the document file to be processed. Each durable functions instance processes only one document file.

  3. The analyze activity function calls the Document Intelligence Analyze Document API, which passes the storage location of the document file to be processed. The analyze function reads and identifies each document within the document file. This function returns the name, type, page ranges, and content of each embedded document to the orchestration.

  4. The metadata store activity function saves the document type, location, and page range information for each document in an Azure Cosmos DB store.

  5. The embedding activity function uses Semantic Kernel to chunk each document and create embeddings for each chunk. Embeddings and associated content are sent to Azure AI Search and stored in a vector enabled index. A correlation ID is also added to the search document so that the search results can be matched with the corresponding document metadata from Azure Cosmos DB.

  6. Semantic Kernel retrieves embeddings from Azure AI Search Vector store for NLP.

  7. Users can chat with their data using natural language processing using grounded data retrieved from the vector store. To look up document records that are in Azure Cosmos DB, they use correlation IDs included in the search result set. The records include links to the original document file in Blob Storage.

Components

  • Durable functions is a feature of Azure Functions that you can use to write stateful functions in a serverless compute environment. In this architecture, a message in a Service Bus queue triggers a durable functions instance, which initiates and orchestrates the document-processing pipeline.

  • Azure Cosmos DB is a globally distributed, multi-model database that you can use in your solutions to scale throughput and storage capacity across any number of geographic regions. Comprehensive service-level agreements (SLAs) guarantee throughput, latency, availability, and consistency. This architecture uses Azure Cosmos DB as the metadata store for the document classification information.

  • Azure Storage is a set of massively scalable and secure cloud services for data, apps, and workloads. It includes Blob Storage, Azure Files, Azure Table Storage, and Azure Queue Storage. This architecture uses Blob Storage to store the document files that the user uploads and that the durable functions pipeline processes.

  • Azure Service Bus is a fully managed enterprise message broker with message queues and publish-subscribe topics. This architecture uses Service Bus to trigger durable functions instances.

  • Azure App Service provides a framework to build, deploy, and scale web apps. The Web Apps feature of App Service is an HTTP-based tool that you can use to host web applications, REST APIs, and mobile back ends. Use Web Apps to develop in .NET, .NET Core, Java, Ruby, Node.js, PHP, or Python. Applications can easily run and scale in Windows and Linux-based environments. In this architecture, users interact with the document-processing system through an App Service-hosted web app.

  • Azure AI Document Intelligence is a service that you can use to extract insights from your documents, forms, and images. This architecture uses AI Document Intelligence to analyze the document files and extract the embedded documents along with content and metadata information.

  • Azure AI Search provides a rich search experience for private, diverse content in web, mobile, and enterprise applications. This architecture uses AI Search Vector Storage to index embeddings of the extracted document content and metadata information so that users can search and retrieve documents using natural language processing.

  • Semantic Kernel is a framework that you can use to integrate large language models (LLMs) into your applications. This architecture uses Semantic Kernel to create embeddings for the document content and metadata information, which are stored in Azure AI Search.

  • Azure OpenAI Service provides access to OpenAI's powerful models. This architecture uses Azure OpenAI Service to provide a natural language interface for users to interact with the document-processing system.

Alternatives

  • To facilitate global distribution, this solution stores metadata in Azure Cosmos DB. Azure SQL Database is another persistent storage option for document metadata and information.

  • To trigger durable functions instances, you can use other messaging platforms, including Azure Event Grid.

  • Semantic Kernel is one of several options for creating embeddings. You can also use Azure Machine Learning or Azure AI services to create embeddings.

  • To provide a natural language interface for users, you can use other large language models (LLMs) within Azure AI Foundry. The platform supports a variety of models from different providers, including Mistral, Meta, Cohere, and Hugging Face.

Scenario details

In this architecture, the pipelines identify the documents in a document file, classify them by type, and store information to use in subsequent processing.

Many companies need to manage and process documents that they scan in bulk and that contain several different document types, such as PDFs or multi-page TIFF images. These documents might originate from outside the organization, and the receiving company doesn't control the format.

Given these constraints, organizations must build their own document-parsing solutions that can include custom technology and manual processes. For example, someone might manually separate individual document types and add classification qualifiers for each document.

Many of these custom solutions are based on the state machine workflow pattern. The solutions use database systems to persist workflow state and use polling services that check for the states that they need to process. Maintaining and enhancing these solutions can increase complexity and effort.

Organizations need reliable, scalable, and resilient solutions to process and manage document identification and classification for their organization's document types. This solution can process millions of documents each day with full observability into the success or failure of the processing pipeline.

The use of natural language processing (NLP) allows users to interact with the system in a conversational manner. Users can ask questions about the documents and receive answers based on the content of the documents.

Potential use cases

You can use this solution to:

  • Report titles. Many government agencies and municipalities manage paper records that don't have a digital form. An effective automated solution can generate a file that contains all the documents that you need to satisfy a document request.

  • Manage maintenance records. You might need to scan and send paper records, such as aircraft, locomotive, and machinery maintenance records, to outside organizations.

  • Process permits. City and county permitting departments maintain paper documents that they generate for permit inspection reporting. You can take a picture of several inspection documents and automatically identify, classify, and search across these records.

  • Planogram analysis. Retail and consumer goods companies manage inventory and compliance through store shelf planogram analysis. You can take a picture of a store shelf and extract label information from varying products and automatically identify, classify, and quantify the product information.

Considerations

These considerations implement the pillars of the Azure Well-Architected Framework, which is a set of guiding tenets that can be used to improve the quality of a workload. For more information, see Microsoft Azure Well-Architected Framework.

Reliability

Reliability ensures your application can meet the commitments you make to your customers. For more information, see Design review checklist for Reliability.

A reliable workload has both resiliency and availability. Resiliency is the ability of the system to recover from failures and continue to function. The goal of resiliency is to return the application to a fully functioning state after a failure occurs. Availability measures whether your users can access your workload when they need to.

To ensure reliability and availability to Azure OpenAI Service endpoints, consider utilizing a generative API gateway for multiple Azure OpenAI deployments or instances. The backend load balancer supports round-robin, weighted, and priority-based load balancing, giving you flexibility to define an Azure OpenAI load distribution strategy that meets your specific requirements.

For reliability information about solution components, see SLA information for Azure online services.

Cost Optimization

Cost Optimization is about looking at ways to reduce unnecessary expenses and improve operational efficiencies. For more information, see Design review checklist for Cost Optimization.

The most significant costs for this architecture are the Azure OpenAI model token usage, Azure AI Document Intelligence image processing, and index capacity requirements in Azure AI Search.

To optimize costs:

Performance Efficiency

Performance Efficiency is the ability of your workload to scale to meet the demands placed on it by users in an efficient manner. For more information, see Design review checklist for Performance Efficiency.

This solution can expose performance bottlenecks when you process high volumes of data. To ensure proper performance efficiency for your solution, make sure that you understand and plan for Azure Functions scaling options, Azure AI services autoscaling, and Azure Cosmos DB partitioning.

Azure OpenAI provisioned throughput units (PTU) offer guaranteed performance and availability, along with global deployments that use Azure's global infrastructure to dynamically route customer traffic to the data center with the best availability for the customer’s inference requests.

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal author:

Other contributors:

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps

Introductory articles:

Product documentation: