Tables Extraction Using Custom Extraction Model (Merged Rows and Row Splitting Issues)

Pankaj Singh Negi 20 Reputation points
2024-12-17T09:30:29.3833333+00:00

Hello,

I am working on extracting tables from PDF documents using a custom extraction model. The documents vary: some contain tables, others have paragraphs, and one is a scanned PDF. On average, each PDF has about 16 pages, with tables typically found on the first 3 to 5 pages. I have annotated fields in the documents and am using a dynamic table field for table extraction.

However, I'm encountering an issue where, although I get the expected results from one document, when I test on another similar document, the extracted table shows merged rows for some entries and also splits a single row into multiple rows in the output.

Additionally, I tested a structured PDF (with paragraphs) and observed unexpected behavior.

Sample Data:

Super Adventure Toy Set Level II involves specialized play features aimed at enhancing imaginative play for kids. Product Tags: 357, 389, 395, 590, 591, 399, 300, 295, 899, 296 are used for tracking, with a price of $82.75 per day of usage.

Ultimate Racing Car Track is designed to improve racing skills and provide hours of competitive fun. Product Tags: 575, 692, 682, 445, 597, 397, 239, 299, 585, 777, 491, 391, 395, 339, 494, 480, 590, 491 are used for this item, with no fixed price as it may vary by location.

Collector's Edition Action Figure Set Phase 2 includes rare and collectible action figures for advanced collectors. Product Tags: 34980, 48091, 48190, 44089, 45198, 46928, 47810, 37991, 48192, 48093, 89291, 349891, priced at $230.90 per set.

In this sample data, instead of showing "no fixed price" in the rate column, the output incorrectly displays the price of the next item ($230.90 per set).

Has anyone experienced similar issues with table extraction in PDFs? Could this be related to the model’s settings, the structure of the documents, or something else? Any advice on how to resolve this or improve the accuracy of table extraction would be greatly appreciated.

Thank you in advance!

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,813 questions
{count} votes

Accepted answer
  1. santoshkc 11,530 Reputation points Microsoft Vendor
    2024-12-19T14:51:16.3533333+00:00

    Hi @Pankaj Singh Negi,

    Thank you for your follow-up query.

    1. To improve model accuracy, it's beneficial to include both variations of the same document (with slight differences in tables) and a broader range of documents with diverse table structures. This will help the model handle slight variations while learning to adapt to different table formats.
    2. Regarding training via REST API, you can train custom models through the API by uploading your labeled PDFs and JSON files. For details, refer to Azure Document Intelligence API documentation.
    3. For technical support or a model review, you can contact Azure support through the portal, especially if you're dealing with inconsistent table structures or scanned PDFs.

    Hope this helps. And, if you have any further query do let us know.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.