Unstable Table Layout with Merged Column Headers

Question

Hello Everyone,

We are currently using Azure Document Intelligence to extract table data from PDF documents. By default, we use the layout model for data extraction. However, we've encountered some issues with specific documents, particularly when:

There are merged cells in the header.
The text in the header cells is middle-aligned.
The text in the data rows of the first column, which involves merged cells, does not overlap with the header text.

For example, instead of extracting the table as intended, the layout model produces an unexpected result. However, if there is a row with long text in the first column, the table is processed correctly. Expected result:

	Brand		Shares	Price	Profit and loss	BS Amount	Investment	Note
51	Fringilla Mi Ltd	Nullam Corp.	12,762	565,151,813	201,690,436	3,539,076	0

Actual result.

	Brand	Shares	Price	Profit and loss	BS Amount	Investment	Note
51	Fringilla Mi Ltd Nullam Corp.	12,762	565,151,813	201,690,436	3,539,076	0

At the same time if there is row with long text in first column the table processed properly.

While we understand that using a custom model can resolve this issue, it is challenging to create a custom model for documents containing different tables with similar initial three columns but differing subsequent columns.

Could you please advise if there is a way to automatically preprocess PDFs to help the layout model provide consistent results in such cases? We have attempted some approaches, such as updating the font and altering table borders, but unfortunately, these did not resolve the issue. If you have any other suggestions or solutions, we would greatly appreciate your guidance.

Attached are examples of a mock documents where the layout model extracts data incorrectly and an updated version of the document with a revised value in a single cell that results in correct extraction. Unfortunately, the latter cannot be used as a permanent solution.

table_extracted_wrong.pdf

table_extracted_properly.pdf

Thank you in advance for your assistance and support.

Answer

Hello Aleksandr Kodiakov,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you are having unstable Table Layout with Merged Column Headers.

This issue with Azure Document Intelligence (formerly Form Recognizer) often arises because the layout model struggles with merged cells or uneven alignment, leading to inconsistencies in table extraction. The solution you need might be too longer for this page, try to read the below carefully and use the links for more steps and configuration.

You will need to use alternative methods to preprocessing PDFs for consistent Input, such as the followings:

Camelot Library: https://camelot-py.readthedocs.io
Tabula-py: https://github.com/tabulapdf/tabula-py
pdfplumber: https://github.com/jsvine/pdfplumber
PyMuPDF (fitz): https://pymupdf.readthedocs.io

In Azure Document Intelligence Layout Model Documentation, you can use the layout model for table extraction, including handling table structures and cell alignment. To know more use this link - https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/layout
Merging Cross-Page Tables with Document Intelligence, detailed techniques for reconstructing tables that span multiple pages is here - https://techcommunity.microsoft.com/t5/ai-blog/a-heuristic-method-of-merging-cross-page-tables-based-on/ba-p/4118126
Enhanced Table Extraction with Custom Models, irrespective of your strategies for handling complex tables and customizing extraction. This link is a good one - https://techcommunity.microsoft.com/t5/ai-blog/enhanced-table-extraction-from-documents-with-form-recognizer/ba-p/2058011
Unstructured Guide on Processing PDFs in Python to process and extract data from PDFs with custom Python logic. - https://unstructured.io/blog/how-to-process-pdf-in-python

I hope this is helpful! Do not hesitate to let me know if you have any other questions.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Share via

Unstable Table Layout with Merged Column Headers

1 answer

Your answer