Tables Extraction Using Custom Extraction Model (Merged Rows and Row Splitting Issues)

Question

Hello,

I am working on extracting tables from PDF documents using a custom extraction model. The documents vary: some contain tables, others have paragraphs, and one is a scanned PDF. On average, each PDF has about 16 pages, with tables typically found on the first 3 to 5 pages. I have annotated fields in the documents and am using a dynamic table field for table extraction.

However, I'm encountering an issue where, although I get the expected results from one document, when I test on another similar document, the extracted table shows merged rows for some entries and also splits a single row into multiple rows in the output.

Additionally, I tested a structured PDF (with paragraphs) and observed unexpected behavior.

Sample Data:

Super Adventure Toy Set Level II involves specialized play features aimed at enhancing imaginative play for kids. Product Tags: 357, 389, 395, 590, 591, 399, 300, 295, 899, 296 are used for tracking, with a price of $82.75 per day of usage.

Ultimate Racing Car Track is designed to improve racing skills and provide hours of competitive fun. Product Tags: 575, 692, 682, 445, 597, 397, 239, 299, 585, 777, 491, 391, 395, 339, 494, 480, 590, 491 are used for this item, with no fixed price as it may vary by location.

Collector's Edition Action Figure Set Phase 2 includes rare and collectible action figures for advanced collectors. Product Tags: 34980, 48091, 48190, 44089, 45198, 46928, 47810, 37991, 48192, 48093, 89291, 349891, priced at $230.90 per set.

In this sample data, instead of showing "no fixed price" in the rate column, the output incorrectly displays the price of the next item ($230.90 per set).

Has anyone experienced similar issues with table extraction in PDFs? Could this be related to the model’s settings, the structure of the documents, or something else? Any advice on how to resolve this or improve the accuracy of table extraction would be greatly appreciated.

Thank you in advance!

Accepted Answer

Hi @Pankaj Singh Negi,

Thank you for your follow-up query.

To improve model accuracy, it's beneficial to include both variations of the same document (with slight differences in tables) and a broader range of documents with diverse table structures. This will help the model handle slight variations while learning to adapt to different table formats.
Regarding training via REST API, you can train custom models through the API by uploading your labeled PDFs and JSON files. For details, refer to Azure Document Intelligence API documentation.
For technical support or a model review, you can contact Azure support through the portal, especially if you're dealing with inconsistent table structures or scanned PDFs.

Hope this helps. And, if you have any further query do let us know.

If this answers your query, do click Accept Answer and Yes for was this answer helpful.

Share via

Tables Extraction Using Custom Extraction Model (Merged Rows and Row Splitting Issues)

0 additional answers

Your answer