Hi @Hongqian Li,
The Azure AI Document Intelligence Prebuilt-Layout model does support extracting paragraph roles, including "sectionHeading"
, from DOCX files when using SDK Version 1.0.0 and the prebuilt-layout model. While you mentioned that roles were not being extracted for DOCX files but worked for PDFs, our testing confirms that the "sectionHeading"
role is recognized in DOCX files as well. The issue might be due to checking roles as raw strings ("sectionHeading"
) instead of using ParagraphRole.SECTION_HEADING
. Below is a working example that successfully extracts section headings from DOCX files.
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence.models import ParagraphRole
import azure.ai.documentintelligence
def analyze_word_document(endpoint, key, document_path):
print(f"Azure Document Intelligence SDK Version: {azure.ai.documentintelligence.__version__}")
client = DocumentIntelligenceClient(
endpoint=endpoint,
credential=AzureKeyCredential(key)
)
with open(document_path, "rb") as f:
document = f.read()
poller = client.begin_analyze_document(
"prebuilt-layout",
document,
content_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
)
result = poller.result()
section_headings = []
for i, paragraph in enumerate(result.paragraphs):
print(f"\nParagraph {i}:")
print(f"Content: {paragraph.content}")
print(f"Role: {paragraph.role}")
# Extract section headings
if paragraph.role == ParagraphRole.SECTION_HEADING:
section_headings.append(paragraph.content)
print(f"\nAll roles found in document: {set(p.role for p in result.paragraphs if p.role is not None)}")
print("\nExtracted Section Headings:")
for heading in section_headings:
print(f"• {heading}")
return section_headings
if __name__ == "__main__":
endpoint = "https://XXXXXXX.cognitiveservices.azure.com/"
key = "KEY"
document_path = "FILE_PATH"
analyze_word_document(endpoint, key, document_path)
To resolve this, ensure you are using Azure AI Document Intelligence SDK v1.0.0 or later, and update your code to check roles using ParagraphRole.SECTION_HEADING
. This approach correctly extracts section headings from DOCX files, aligning with the expected behaviour of the prebuilt-layout model.
Hope this helps. Let me know if you need further assistance!
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful.