Hello Aleksandr Kodiakov,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that you are having unstable Table Layout with Merged Column Headers.
This issue with Azure Document Intelligence (formerly Form Recognizer) often arises because the layout model struggles with merged cells or uneven alignment, leading to inconsistencies in table extraction. The solution you need might be too longer for this page, try to read the below carefully and use the links for more steps and configuration.
- You will need to use alternative methods to preprocessing PDFs for consistent Input, such as the followings:
- Camelot Library: https://camelot-py.readthedocs.io
- Tabula-py: https://github.com/tabulapdf/tabula-py
- pdfplumber: https://github.com/jsvine/pdfplumber
- PyMuPDF (fitz): https://pymupdf.readthedocs.io
- In Azure Document Intelligence Layout Model Documentation, you can use the layout model for table extraction, including handling table structures and cell alignment. To know more use this link - https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/layout
- Merging Cross-Page Tables with Document Intelligence, detailed techniques for reconstructing tables that span multiple pages is here - https://techcommunity.microsoft.com/t5/ai-blog/a-heuristic-method-of-merging-cross-page-tables-based-on/ba-p/4118126
- Enhanced Table Extraction with Custom Models, irrespective of your strategies for handling complex tables and customizing extraction. This link is a good one - https://techcommunity.microsoft.com/t5/ai-blog/enhanced-table-extraction-from-documents-with-form-recognizer/ba-p/2058011
- Unstructured Guide on Processing PDFs in Python to process and extract data from PDFs with custom Python logic. - https://unstructured.io/blog/how-to-process-pdf-in-python
I hope this is helpful! Do not hesitate to let me know if you have any other questions.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.