Hi!
no, at the moment I haven't found a solution to the problem yet. I will try to insert a support request to find out more and find a solution
This browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
I'm using the Azure Document Intelligence service to analyze different types of documents. I set the output format style in Markdown to be able to have more information regarding the structure and formatting of the document.
To get the result in markdown in Python code I use the following code:
document_intelligence_client = DocumentIntelligenceClient(
endpoint=os.getenv("DOCUMENT_INTELLIGENCE_ENDPOINT"),
credential=AzureKeyCredential(os.getenv("DOCUMENT_INTELLIGENCE_KEY"))
)
poller = document_intelligence_client.begin_analyze_document(
"prebuilt-layout",
AnalyzeDocumentRequest(bytes_source=self.__file_bytes.read()),
output_content_format=ContentFormat.MARKDOWN,
)
result: AnalyzeResult = poller.result()
This code works, no error. It returns correct Markdown formatted text.
If the original document contains a footer with the page number, a specific tag with the detected page number is reported in the Markdown result.
Document footer:
Markdown result tag:
However, if the original document does not have a footer with the page number, in the Markdown result I find no indication of the division of the pages, but it turns out to be a single entire document.
Analyzing the JSON structure returned by the Document Intelligence service I saw that a subdivision of the document into pages and lines is returned for each page.
(screenshot in python code)
I tried to rebuild the pages using the "content" property of lines elements, but the textual result is not the same as the entire text in Markdown.
Furthermore, it happens that if a page ends with a table or an image before the page footer with the page number, the page number does not turn out to be the last actual line of the page, this makes it difficult to structure an algorithm to identify the end of pages in the result in Markdown
Is there anyone who has encountered the same problems as me? Can anyone recommend an effective method for splitting the result in Markdown format while maintaining the page subdivision of the original document?
Hi!
no, at the moment I haven't found a solution to the problem yet. I will try to insert a support request to find out more and find a solution