Azure AI document intelligence prebuilt-Layout model cannot extract "role" with docx files

Question

I am using Azure Document Intelligence SDK (Version 1.0.0) with the prebuilt-layout model to extract paragraph roles from documents with docx type(it works well with PDFs). However, I am not receiving any roles for the paragraphs, even though the documentation mentions that the "sectionHeading" role is supported in version 4.0 https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/layout?view=doc-intel-4.0.0&tabs=rest%2Csample-code#paragraph-roles

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
import azure.ai.documentintelligence

def analyze_word_document(endpoint, key, document_path):
    print(f"Azure Document Intelligence SDK Version: {azure.ai.documentintelligence.__version__}")

    client = DocumentIntelligenceClient(
        endpoint=endpoint,
        credential=AzureKeyCredential(key)
    )

    with open(document_path, "rb") as f:
        document = f.read()

    poller = client.begin_analyze_document(
        "prebuilt-layout",
        document,
        content_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
    )
    
    result = poller.result()

    for i, paragraph in enumerate(result.paragraphs):
        print(f"
Paragraph {i}:")
        print(f"Content: {paragraph.content}")
        print(f"Role: {paragraph.role}")

    # Collect all roles
    roles = set(p.role for p in result.paragraphs if p.role is not None)
    print(f"
All roles found in document: {roles}")

    # Extract section headings
    section_headings = []
    for paragraph in result.paragraphs:
        if paragraph.role == "sectionHeading":
            section_headings.append({
                'content': paragraph.content,
                'bounding_box': paragraph.bounding_box
            })
    
    return section_headings

Accepted Answer

Hi @Hongqian Li,

The Azure AI Document Intelligence Prebuilt-Layout model does support extracting paragraph roles, including "sectionHeading", from DOCX files when using SDK Version 1.0.0 and the prebuilt-layout model. While you mentioned that roles were not being extracted for DOCX files but worked for PDFs, our testing confirms that the "sectionHeading" role is recognized in DOCX files as well. The issue might be due to checking roles as raw strings ("sectionHeading") instead of using ParagraphRole.SECTION_HEADING. Below is a working example that successfully extracts section headings from DOCX files.

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence.models import ParagraphRole
import azure.ai.documentintelligence

def analyze_word_document(endpoint, key, document_path):
    print(f"Azure Document Intelligence SDK Version: {azure.ai.documentintelligence.__version__}")

    client = DocumentIntelligenceClient(
        endpoint=endpoint,
        credential=AzureKeyCredential(key)
    )

    with open(document_path, "rb") as f:
        document = f.read()

    poller = client.begin_analyze_document(
        "prebuilt-layout",
        document,
        content_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
    )
    
    result = poller.result()

    section_headings = []

    for i, paragraph in enumerate(result.paragraphs):
        print(f"
Paragraph {i}:")
        print(f"Content: {paragraph.content}")
        print(f"Role: {paragraph.role}")

        # Extract section headings
        if paragraph.role == ParagraphRole.SECTION_HEADING:
            section_headings.append(paragraph.content)

    print(f"
All roles found in document: {set(p.role for p in result.paragraphs if p.role is not None)}")

    print("
Extracted Section Headings:")
    for heading in section_headings:
        print(f"• {heading}")

    return section_headings

if __name__ == "__main__":
    endpoint = "https://XXXXXXX.cognitiveservices.azure.com/"
    key = "KEY"
    document_path = "FILE_PATH"

    analyze_word_document(endpoint, key, document_path)

To resolve this, ensure you are using Azure AI Document Intelligence SDK v1.0.0 or later, and update your code to check roles using ParagraphRole.SECTION_HEADING. This approach correctly extracts section headings from DOCX files, aligning with the expected behaviour of the prebuilt-layout model.

Hope this helps. Let me know if you need further assistance!

If this answers your query, do click Accept Answer and Yes for was this answer helpful.

Share via

Azure AI document intelligence prebuilt-Layout model cannot extract "role" with docx files

0 additional answers

Your answer