Azure AI document intelligence prebuilt-Layout model cannot extract "role" with docx files

Hongqian Li 20 Reputation points
2025-02-20T06:20:06.0433333+00:00

I am using Azure Document Intelligence SDK (Version 1.0.0) with the prebuilt-layout model to extract paragraph roles from documents with docx type(it works well with PDFs). However, I am not receiving any roles for the paragraphs, even though the documentation mentions that the "sectionHeading" role is supported in version 4.0 https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/layout?view=doc-intel-4.0.0&tabs=rest%2Csample-code#paragraph-roles

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
import azure.ai.documentintelligence

def analyze_word_document(endpoint, key, document_path):
    print(f"Azure Document Intelligence SDK Version: {azure.ai.documentintelligence.__version__}")

    client = DocumentIntelligenceClient(
        endpoint=endpoint,
        credential=AzureKeyCredential(key)
    )

    with open(document_path, "rb") as f:
        document = f.read()

    poller = client.begin_analyze_document(
        "prebuilt-layout",
        document,
        content_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
    )
    
    result = poller.result()

    for i, paragraph in enumerate(result.paragraphs):
        print(f"\nParagraph {i}:")
        print(f"Content: {paragraph.content}")
        print(f"Role: {paragraph.role}")

    # Collect all roles
    roles = set(p.role for p in result.paragraphs if p.role is not None)
    print(f"\nAll roles found in document: {roles}")

    # Extract section headings
    section_headings = []
    for paragraph in result.paragraphs:
        if paragraph.role == "sectionHeading":
            section_headings.append({
                'content': paragraph.content,
                'bounding_box': paragraph.bounding_box
            })
    
    return section_headings

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,946 questions
0 comments No comments
{count} votes

Accepted answer
  1. santoshkc 12,990 Reputation points Microsoft Vendor
    2025-02-21T11:49:13.6233333+00:00

    Hi @Hongqian Li,

    The Azure AI Document Intelligence Prebuilt-Layout model does support extracting paragraph roles, including "sectionHeading", from DOCX files when using SDK Version 1.0.0 and the prebuilt-layout model. While you mentioned that roles were not being extracted for DOCX files but worked for PDFs, our testing confirms that the "sectionHeading" role is recognized in DOCX files as well. The issue might be due to checking roles as raw strings ("sectionHeading") instead of using ParagraphRole.SECTION_HEADING. Below is a working example that successfully extracts section headings from DOCX files.

    from azure.ai.documentintelligence import DocumentIntelligenceClient
    from azure.core.credentials import AzureKeyCredential
    from azure.ai.documentintelligence.models import ParagraphRole
    import azure.ai.documentintelligence
    
    def analyze_word_document(endpoint, key, document_path):
        print(f"Azure Document Intelligence SDK Version: {azure.ai.documentintelligence.__version__}")
    
        client = DocumentIntelligenceClient(
            endpoint=endpoint,
            credential=AzureKeyCredential(key)
        )
    
        with open(document_path, "rb") as f:
            document = f.read()
    
        poller = client.begin_analyze_document(
            "prebuilt-layout",
            document,
            content_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
        )
        
        result = poller.result()
    
        section_headings = []
    
        for i, paragraph in enumerate(result.paragraphs):
            print(f"\nParagraph {i}:")
            print(f"Content: {paragraph.content}")
            print(f"Role: {paragraph.role}")
    
            # Extract section headings
            if paragraph.role == ParagraphRole.SECTION_HEADING:
                section_headings.append(paragraph.content)
    
        print(f"\nAll roles found in document: {set(p.role for p in result.paragraphs if p.role is not None)}")
    
        print("\nExtracted Section Headings:")
        for heading in section_headings:
            print(f"• {heading}")
    
        return section_headings
    
    if __name__ == "__main__":
        endpoint = "https://XXXXXXX.cognitiveservices.azure.com/"
        key = "KEY"
        document_path = "FILE_PATH"
    
        analyze_word_document(endpoint, key, document_path)
    

    To resolve this, ensure you are using Azure AI Document Intelligence SDK v1.0.0 or later, and update your code to check roles using ParagraphRole.SECTION_HEADING. This approach correctly extracts section headings from DOCX files, aligning with the expected behaviour of the prebuilt-layout model.

    Hope this helps. Let me know if you need further assistance!


    If this answers your query, do click Accept Answer and Yes for was this answer helpful.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.