Azure Document Intelligence: Custom Model Fails to Separate Adjacent Document Number and Date

Suma Sai Paluri 0 Reputation points
2025-02-21T15:06:48.36+00:00

Hello Microsoft Community,

I'm encountering an issue with Azure Document Intelligence (API version 2024-11-30-GA, 4.0 Generally Availability) when trying to extract the Document Number and Date from documents where these fields are adjacent and always present in the same format. I'm using a custom-trained neural model.

Problem Description:

I have trained two separate labels: "Document Number" and "Date". However, when processing documents, the model often incorrectly extracts the entire line (both the document number and the date) as just the "Document Number." It seems to be having trouble distinguishing between the two fields.

The format of the document is very consistent across all training and test documents. The Document Number and Date are always presented in the same order and with the same separator (a slash).

Image February 20, 2025 - 6:44PM.png

Since they are considered as one value I have used Draw region feature to label them as to different entities.

Example:

Image February 20, 2025 - 6:48PM.png

As you can see in the image, the "Document Number" should be "0094140914", and the "Date" should be "06.08.2024". Currently, the model is sometimes extracting "0094140914 06.08.2024" as the "Document Number" instead of correctly separating the two fields.

User's image

What I've Tried:

  • Separate Labeling: I've confirmed that I'm labeling the Document Number and Date as distinct regions in my training documents using Document Intelligence Studio.

Training Data Review: I've checked for labeling errors and ensured that each field is labeled correctly.

  • Multiple Training Documents: I have trained the model using a significant number of training documents exactly 105.

Questions:

Is this a known limitation or issue with Azure Document Intelligence, particularly when using custom neural models and dealing with adjacent, consistently formatted fields?

Are there any specific best practices or techniques to improve the extraction accuracy in this scenario?

Is there a recommended way to define a delimiter or separator between the two fields to help the model distinguish them?

Are there other recommendations for me to test the model?

Any insights or suggestions would be greatly appreciated. Thank you!

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,946 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.