Issue with Extracting Table with Merged Cells in Azure Document Intelligence Custom Model

Question

Hi Community
I have trained a Custom AI Model in Azure Document Intelligence to extract tables from PDFs. The model works well for most tables, but it's failing to extract one specific table that contains:

Merged cells in the header
Multi-line text in some columns
Arrows and phase indicators above the table that I don't need

When I test the model using Power Automate, I don’t get any JSON output for this table. Other tables in the same document are extracted correctly.
User's image

here is the sample of the table i need to extract (first 4 columns)

Troubleshooting Steps I Tried:

✔ Trained the model with multiple variations of the table. ✔ Enabled "Advanced Table Extraction" mode. ✔ Ensured proper labeling during model training. ✔ Checked if the issue is related to Power Automate by testing in the Azure AI Studio directly.

Question:

How can I improve table extraction for merged cells?
Is there a way to filter out non-table elements (like arrows) automatically before AI processing?
Should I preprocess the document using OCR in Power Automate to extract clean text first?

Any insights or suggestions would be greatly appreciated! 🚀

Answer

Hi Udit Sati,

Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

It sounds like you've put in a lot of effort to train your custom AI model in Azure Document Intelligence. Here are some suggestions to address the issues you're facing:

How can I improve table extraction for merged cells?

Ensure that your training set includes diverse samples of tables with merged cells. Explicitly annotate row and column boundaries to cover edge cases. Consider using fixed table fields for structured layouts, as they provide stricter column mapping.

Train your model with diverse table variations, including merged and non-merged headers.
Use the Prebuilt Layout Model (prebuilt-layout), which has better table handling than custom models.
Post-process extracted data using Python (Pandas) to reconstruct tables if needed. Microsoft Docs: Prebuilt Layout Model

Is there a way to filter out non-table elements (like arrows) automatically before AI processing?

You can preprocess the document to remove non-table elements like arrows and phase indicators. This can be done using custom scripts or tools that clean up the document before feeding it into the AI model.

Preprocess documents with Azure AI Vision or OpenCV to remove non-table elements before processing.
If unwanted elements are outside the table, consider the Prebuilt Key-Value Model (prebuilt-invoice, prebuilt-receipt) for structured data extraction. Microsoft Docs: Optimize Document Preprocessing

Should I preprocess the document using OCR in Power Automate to extract clean text first?

Preprocessing the document using OCR in Power Automate can help extract clean text and improve the accuracy of table extraction. This step can ensure that the OCR quality is high, which is crucial for scanned PDFs.

While Azure Document Intelligence includes OCR, using Azure AI Vision OCR in Power Automate may enhance text clarity before extraction.
Compare results from the Document Intelligence OCR Model (prebuilt-read) and your custom model to identify the best approach. Microsoft Docs: Azure AI Vision OCR

I hope these suggestions help improve your model's performance. If you have any further questions or need more assistance, feel free to ask!

If the response helped, please do click Accept Answer and Yes for was this answer helpful.

Doing so would help other community members with similar issue identify the solution. I highly appreciate your contribution to the community.

Thank You

Share via

Issue with Extracting Table with Merged Cells in Azure Document Intelligence Custom Model

1 answer

Your answer