Azure Document Intelligence: Custom Model Fails to Separate Adjacent Document Number and Date

Suma Sai Paluri 0

Hello Microsoft Community,

I'm encountering an issue with Azure Document Intelligence (API version 2024-11-30-GA, 4.0 Generally Availability) when trying to extract the Document Number and Date from documents where these fields are adjacent and always present in the same format. I'm using a custom-trained neural model.

Problem Description:

I have trained two separate labels: "Document Number" and "Date". However, when processing documents, the model often incorrectly extracts the entire line (both the document number and the date) as just the "Document Number." It seems to be having trouble distinguishing between the two fields.

The format of the document is very consistent across all training and test documents. The Document Number and Date are always presented in the same order and with the same separator (a slash).

Image February 20, 2025 - 6:44PM.png

Since they are considered as one value I have used Draw region feature to label them as to different entities.

Example:

Image February 20, 2025 - 6:48PM.png

As you can see in the image, the "Document Number" should be "0094140914", and the "Date" should be "06.08.2024". Currently, the model is sometimes extracting "0094140914 06.08.2024" as the "Document Number" instead of correctly separating the two fields.

User's image

What I've Tried:

Separate Labeling: I've confirmed that I'm labeling the Document Number and Date as distinct regions in my training documents using Document Intelligence Studio.

Training Data Review: I've checked for labeling errors and ensured that each field is labeled correctly.

Multiple Training Documents: I have trained the model using a significant number of training documents exactly 105.

Questions:

Is this a known limitation or issue with Azure Document Intelligence, particularly when using custom neural models and dealing with adjacent, consistently formatted fields?

Are there any specific best practices or techniques to improve the extraction accuracy in this scenario?

Is there a recommended way to define a delimiter or separator between the two fields to help the model distinguish them?

Are there other recommendations for me to test the model?

Any insights or suggestions would be greatly appreciated. Thank you!

Pavankumar Purilla 3,715 Reputation points Microsoft Vendor

2025-02-24T18:41:48.4333333+00:00

Hi Suma Paluri,
Greetings & Welcome to the Microsoft Q&A forum! Thank you for sharing your query.
Azure Document Intelligence's custom neural models can sometimes struggle with adjacent fields, especially if they are consistently formatted and close together.

Since you're already using the Draw region feature, try using the Bounding Box method to ensure the labels are tightly placed around each field.

Avoid overlapping labeled regions. If labels slightly overlap, the model may treat them as a single entity.
If the training data is too consistent, consider adding slight variations to the document structure. This can improve model generalization.
Use Key-Value Pairing instead of plain labels to help the model recognize the relationship between the "Document Number" and "Date" fields more effectively.

I hope this information helps.
Suma Sai Paluri 0 Reputation points

2025-02-25T14:55:51.51+00:00

Hi @Pavankumar Purilla ,thank you very much for your suggestion. Despite carefully using draw region(like bounding boxes) and following key value pair style labeling the problem still persists. I have created a new model with lesser documents and also tried preventing consistent formatting by including few files in the training data where the date or document numbers are only labelled.

Can you please help me identify any other solution for the problem.

Thanks

Suma
Pavankumar Purilla 3,715 Reputation points Microsoft Vendor

2025-02-26T00:01:16.58+00:00

Hi Suma Paluri,
Since the issue persists despite using bounding boxes, key-value pairs, and varied training data, you can try a few more approaches. First, consider training separate models—one for "Document Number" and another for "Date"—to see if this improves accuracy. If possible, modify input documents by increasing space between fields or adding separators like | or : to help the model distinguish them. If extraction errors still occur, use a post-processing step to split merged values using regex. You can also test Azure’s prebuilt invoice or identity models to check if they perform better. Finally, if post-processing is an option, integrating Azure AI Search with regex-based extraction can refine results.
I hope this information helps.
Pavankumar Purilla 3,715 Reputation points Microsoft Vendor

2025-02-26T17:02:08.8066667+00:00

Hi Suma Paluri,
Just following up to see if you had a chance to review the above response. Thank you!
Suma Sai Paluri 0 Reputation points

2025-02-28T09:53:52.0133333+00:00

Thank you for the suggestions @Pavankumar Purilla I have taken your inputs to train a individual models for Document number and Date and observed that individual model itself is not identifying the entity in the first place. And the option two of modifying documents by increasing the space or delimiters is not possible because the incoming documents to this model are going to be the one's without any delimiters or space. Therefore I am still trying to find a solution to the problem.

Share via

Azure Document Intelligence: Custom Model Fails to Separate Adjacent Document Number and Date

Your answer