Issues with Microsoft Syntex Document Processing Model: Incomplete Extraction for Multi-Page PDFs

Swami Nawale 20 Reputation points
2025-02-19T06:51:00.6266667+00:00

I'm facing several challenges with the Microsoft Syntex document processing model, particularly when dealing with multi-page PDFs and large tables. I'd appreciate any insights or suggestions from fellow users or Microsoft experts who may have encountered similar issues. Below are the specific problems I'm experiencing:

  1. Unsupported PDF Formats & Multiple Tables on a Single Page:
    • Some PDFs that I try to process seem to be in unsupported formats. In addition, pages containing multiple tables often result in extraction errors or incomplete data. Has anyone else encountered this with complex table layouts in PDFs, and what approaches have you used to resolve it?
  2. Data Extraction from Multi-Page PDFs:
    • When processing PDFs longer than two pages (e.g., six-page PDFs), the model often extracts data correctly from only the first two or three pages. The remaining pages, particularly those with tables spanning multiple pages, are either incomplete or entirely missing. Additionally, large tables (100+ rows) in multi-page PDFs tend to result in inaccurate extraction. Are there any best practices for handling these multi-page table scenarios?
    1. Automatic Processing Issues:
      • Sometimes, the Syntex model doesn't process files automatically. I have to manually select the file and click "Classify" to trigger processing. Is this a known issue, or is there something I might be missing in my setup?
      1. Model Publishing Delays:
        • After publishing changes to the model, it often takes an extended period (up to 30 minutes or more) for the new model to start processing files. In some cases, the files aren't processed at all. Has anyone experienced similar delays after publishing a model, and what could be causing this?
        1. Low Confidence Scores for Multi-Page PDFs:
          • When processing multi-page PDFs with tables, the model returns low confidence scores (below 60%). What steps can I take to improve these accuracy scores, particularly for documents with complex table structures?

Looking forward to your thoughts.

SharePoint
SharePoint
A group of Microsoft Products and technologies used for sharing and managing content, knowledge, and applications.
11,231 questions
Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,946 questions
{count} votes

Accepted answer
  1. Yanli Jiang - MSFT 29,686 Reputation points Microsoft Vendor
    2025-02-20T09:39:57.9133333+00:00

    Hi @Swami Nawale ,

    Welcome to Q&A forum!

    1. Unsupported PDF Formats & Multiple Tables on a Single Page
    • Ensure PDFs Are in a Supported Format: Syntex works best with PDFs that have an embedded text layer. If your PDFs are scanned images or generated in a non‑standard format (for example, using unusual fonts or encodings), the model may have difficulty extracting data. Consider pre‑processing these files with an OCR tool (such as Adobe Acrobat’s OCR or a dedicated OCR service) to convert them into searchable, text‑based PDFs.
    • Simplify the Layout: When a page contains multiple tables, the extraction engine may “get confused” about where one table ends and another begins. If possible, reformat the source documents to separate tables onto different pages or reduce overlapping elements. In cases where you can’t change the source, you might try splitting the page into individual sections before ingestion.

    1. Data Extraction from Multi-Page PDFs
    • Split or Pre-Process Documents: If the model consistently extracts data from only the first few pages, consider splitting large PDFs into smaller chunks (for example, one PDF per page or per table section). This can help the model process each segment individually, which may lead to more complete extraction.

    Augment Your Training Samples: Syntex’s extraction quality depends on the examples you provide during training. Include multi‑page documents and samples with large tables (100+ rows) in your training set so the model learns the structure and nuances of your documents.

    Ensure Consistent Formatting: Inconsistent table structures (such as merged cells, variable column widths, or split headers) can cause extraction errors. If you have control over the document generation process, standardizing table formats can improve the consistency of extraction.


    1. Automatic Processing Issues
    • Review Library and Content Type Settings: Automatic classification and processing in Syntex are driven by configuration on your document libraries. Confirm that your library is properly configured to trigger Syntex processing—for example, ensuring the correct content type or metadata is set to prompt the model.
    • Monitor for Known Service Issues: Check the Microsoft 365 Message Center for any advisories. In some cases, intermittent issues with the Syntex service may require a manual “Classify” action. Ensuring you’re running the latest service updates might help.

    1. Model Publishing Delays
    • Expect Propagation Delays: After publishing changes to your Syntex model, it may take up to 30 minutes (or sometimes longer) for the new configuration to be fully propagated. This delay can be due to internal queueing and processing within the Syntex pipeline.
    • Plan Updates During Off-Peak Hours: If possible, schedule model changes when there is less activity so that the delay doesn’t impact critical processing windows.

    1. Low Confidence Scores for Multi-Page PDFs
    • Improve Document Quality: Ensure your PDFs are high resolution and that the text is clear. Poor quality scans can lead to lower confidence scores.
    • Increase Training Data: Augment your model’s training set with a variety of multi‑page PDFs—especially ones with complex tables—to help the model better understand how to extract data from these layouts.
    • Use Feedback Loops: Take advantage of any “feedback” mechanism in the Syntex UI to correct extraction errors. Over time, this feedback can improve model performance for similar documents.
    • Custom Extraction Rules: If your documents have predictable patterns (for instance, large tables always follow a specific header format), consider exploring whether you can apply custom extraction rules or templates within Syntex to improve accuracy.

    These challenges are not uncommon when processing complex documents with Syntex.

    Hope this can help.

    Please do let us know if you have any further queries.

    Kindly consider accepting the answer if the information provided is helpful. This can assist other community members in resolving similar issues.


1 additional answer

Sort by: Most helpful
  1. Swami Nawale 20 Reputation points
    2025-02-26T09:19:45.9766667+00:00

    Thank you so much for the detailed explanation and the helpful workaround! I have made adjustments to the model based on your suggestions, and I am getting the expected results.

    Thanks again!


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.