Hello,
I am working on extracting tables from PDF documents using a custom extraction model. The documents vary: some contain tables, others have paragraphs, and one is a scanned PDF. On average, each PDF has about 16 pages, with tables typically found on the first 3 to 5 pages. I have annotated fields in the documents and am using a dynamic table field for table extraction.
However, I'm encountering an issue where, although I get the expected results from one document, when I test on another similar document, the extracted table shows merged rows for some entries and also splits a single row into multiple rows in the output.
Additionally, I tested a structured PDF (with paragraphs) and observed unexpected behavior.
Sample Data:
Super Adventure Toy Set Level II involves specialized play features aimed at enhancing imaginative play for kids. Product Tags: 357, 389, 395, 590, 591, 399, 300, 295, 899, 296 are used for tracking, with a price of $82.75 per day of usage.
Ultimate Racing Car Track is designed to improve racing skills and provide hours of competitive fun. Product Tags: 575, 692, 682, 445, 597, 397, 239, 299, 585, 777, 491, 391, 395, 339, 494, 480, 590, 491 are used for this item, with no fixed price as it may vary by location.
Collector's Edition Action Figure Set Phase 2 includes rare and collectible action figures for advanced collectors. Product Tags: 34980, 48091, 48190, 44089, 45198, 46928, 47810, 37991, 48192, 48093, 89291, 349891, priced at $230.90 per set.
In this sample data, instead of showing "no fixed price" in the rate column, the output incorrectly displays the price of the next item ($230.90 per set).
Has anyone experienced similar issues with table extraction in PDFs? Could this be related to the model’s settings, the structure of the documents, or something else? Any advice on how to resolve this or improve the accuracy of table extraction would be greatly appreciated.
Thank you in advance!