Hi Tommy He,
Welcome to Microsoft Q&A forum! Thanks for your question. I'll address your queries regarding Azure Document Intelligence and provide references to the documentation where applicable.
Question 1 : Is this the only main documentation site?
Yes, the primary documentation for Azure Document Intelligence is: 🔗 Azure Document Intelligence Documentation
For API and SDK references, check: 🔗 Azure SDK for Document Intelligence (GitHub)
While this covers most functionalities, if you're looking for deeper interface details, consider exploring the Azure AI Services Blog for additional insights.
Question 2: Are all bounding regions from the same page when extracted from the same paragraph, table, or figure?
Not necessarily. While paragraphs and tables usually belong to a single page, tables and figures spanning multiple pages might have bounding boxes across different pages. You can verify the page number by checking the pageNumber
attribute in the response JSON.
Example:
{
"paragraphs": [
{
"content": "Sample text",
"boundingRegions": [
{
"pageNumber": 1,
"polygon": [ ... ]
}
]
}
]
}
Reference: 🔗 Azure Document Intelligence Layout Model
Question 3: Are polygons expressed with alternating X and Y coordinates?
Yes, the polygon
array contains alternating X and Y coordinates, defining the bounding region of an extracted element. Example:
"polygon": [ 100, 200, 150, 200, 150, 250, 100, 250 ]
This represents four points defining the bounding box.
Question 4: How can I extract all text and figures in reading order?
The prebuilt layout model structures text in reading order. However, headers and footers may not always be captured. Your best approach is:
- Loop through all
paragraphs
,tables
, andfigures
in the response. - Sort by
boundingRegions.pageNumber
to maintain order.
Example Python code:
for page in result["pages"]:
for paragraph in page.get("paragraphs", []):
print(paragraph["content"])
for table in page.get("tables", []):
print(table["cells"])
Reference: 🔗 Extracting Text Using Azure Document Intelligence
Question 5: Regex to match section types
Your regex /([^/]+)/(\d+)/
should work fine for extracting section types and indices. The expected document structure aligns with:
class DocumentSectionType:
PARAGRAPHS = "paragraphs"
TABLES = "tables"
FIGURES = "figures"
SECTIONS = "sections"
If you're missing elements, consider using debugging logs to validate section mappings.
For Further references:
Hope this helps! Please try these suggestions and let me know if you need further assistance.
Regards,
Chakravarthi Rangarajan Bhargavi
- Please accept the answer and vote 'Yes' if you find it helpful. This helps support the community. Thanks!