Unstable Table Layout with Merged Column Headers

Aleksandr Kodiakov 0 Reputation points
2025-01-26T13:44:54.73+00:00

Hello Everyone,

We are currently using Azure Document Intelligence to extract table data from PDF documents. By default, we use the layout model for data extraction. However, we've encountered some issues with specific documents, particularly when:

  • There are merged cells in the header.
  • The text in the header cells is middle-aligned.
  • The text in the data rows of the first column, which involves merged cells, does not overlap with the header text.

For example, instead of extracting the table as intended, the layout model produces an unexpected result. However, if there is a row with long text in the first column, the table is processed correctly. Expected result:

Brand Shares Price Profit and loss BS Amount Investment Note
51 Fringilla Mi Ltd Nullam Corp. 12,762 565,151,813 201,690,436 3,539,076 0

Actual result.

Brand Shares Price Profit and loss BS Amount Investment Note
51 Fringilla Mi Ltd Nullam Corp. 12,762 565,151,813 201,690,436 3,539,076 0

At the same time if there is row with long text in first column the table processed properly.

While we understand that using a custom model can resolve this issue, it is challenging to create a custom model for documents containing different tables with similar initial three columns but differing subsequent columns.

Could you please advise if there is a way to automatically preprocess PDFs to help the layout model provide consistent results in such cases? We have attempted some approaches, such as updating the font and altering table borders, but unfortunately, these did not resolve the issue. If you have any other suggestions or solutions, we would greatly appreciate your guidance.

Attached are examples of a mock documents where the layout model extracts data incorrectly and an updated version of the document with a revised value in a single cell that results in correct extraction. Unfortunately, the latter cannot be used as a permanent solution.

table_extracted_wrong.pdf

table_extracted_properly.pdf

Thank you in advance for your assistance and support.

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,882 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Sina Salam 16,536 Reputation points
    2025-01-26T15:45:29.22+00:00

    Hello Aleksandr Kodiakov,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that you are having unstable Table Layout with Merged Column Headers.

    This issue with Azure Document Intelligence (formerly Form Recognizer) often arises because the layout model struggles with merged cells or uneven alignment, leading to inconsistencies in table extraction. The solution you need might be too longer for this page, try to read the below carefully and use the links for more steps and configuration.

    1. You will need to use alternative methods to preprocessing PDFs for consistent Input, such as the followings:
    1. In Azure Document Intelligence Layout Model Documentation, you can use the layout model for table extraction, including handling table structures and cell alignment. To know more use this link - https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/layout
    2. Merging Cross-Page Tables with Document Intelligence, detailed techniques for reconstructing tables that span multiple pages is here - https://techcommunity.microsoft.com/t5/ai-blog/a-heuristic-method-of-merging-cross-page-tables-based-on/ba-p/4118126
    3. Enhanced Table Extraction with Custom Models, irrespective of your strategies for handling complex tables and customizing extraction. This link is a good one - https://techcommunity.microsoft.com/t5/ai-blog/enhanced-table-extraction-from-documents-with-form-recognizer/ba-p/2058011
    4. Unstructured Guide on Processing PDFs in Python to process and extract data from PDFs with custom Python logic. - https://unstructured.io/blog/how-to-process-pdf-in-python

    I hope this is helpful! Do not hesitate to let me know if you have any other questions.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.