Hi Andrei Pruteanu,
Welcome to Microsoft Q&A forum. Thank you for posting your query.
From the provided images I understood that you are using Azure Document Intelligence Studio’s contract prebuilt model to extract entities from a legal agreement. However, you are facing an issue where the model duplicates entities on the same tokens, specifically for the Effective Date and Renewal Date fields. Both extracted entities have the same confidence score (99.90%), and one of them is incorrectly labelled.
Some Possible Reasons for Duplicate Entities:
Overlapping Fields in the Prebuilt Model:
The contract model may have predefined labels for similar types of dates (e.g., Effective Date, Expiry Date, and Renewal Date).
The model sometimes assigns multiple labels to the same token when it cannot clearly distinguish between them.
Ambiguity in the Textual Structure:
The agreement mentions multiple dates within close proximity, which may confuse the model.
The model may incorrectly classify the same date under multiple categories due to lack of context.
OCR Errors or Text Positioning Issues:
If the OCR engine detects text slightly differently (e.g., extra spacing, punctuation variations), it may lead to duplicate extractions.
Prebuilt Model's Generalization Issue:
Since prebuilt models are trained on generic datasets, they may not always differentiate between context-specific terms, leading to overlapping entities.
Solutions to Resolve the Issue:
Adjust Model Parameters in Document Intelligence Studio:
If possible, tweak the document processing parameters (e.g., entity confidence threshold, OCR settings) to improve the entity recognition.
You can also manually validate entities and fine-tune the labelling logic.
Manually Review & Fine-Tune Extraction:
If your use case allows, use human-in-the-loop validation to correct incorrect entity assignments.
Azure provides Labelling Studio, where you can review and retrain models with corrected labels.
Post-Processing Logic to Remove Duplicates:
Implement a post-processing script to filter out incorrect duplicate entities.
Logic:
Compare extracted values for Effective Date and Renewal Date.
If both have the same confidence score and value, retain only one based on additional logic (e.g., priority in position in text).
Use the Custom Model Instead of the Prebuilt Model:
If this issue persists frequently, train a custom model on your dataset with specific labels for Effective Date and Renewal Date.
A custom model can improve accuracy by understanding your specific contract format.
Hope this helps. Do let us know if you any further queries.
------------
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful.
Thank you.