Decent results using the Custom Classification model for invoices got worse after increasing the number of training documents

Patrick Gonzalez 0 Reputation points
2025-02-04T02:20:23.25+00:00

We initially trained a model to detect invoices using 10 documents. We saw decent results, but found that some obvious invoice documents (to a human) resulted in very low confidence levels. We then increased the number of training documents to 99 and the model began producing worse results in the other direction, meaning that documents that were obviously not invoices were classified as invoices with very high confidence. Has anyone else seen this behavior?

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,912 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Vikram Singh 1,630 Reputation points Microsoft Employee
    2025-02-05T09:37:55.14+00:00

    Hi Patrick Gonzalez,

    I understand your frustration with the model's performance after increasing the number of training documents. Let's delve a bit deeper into the issue and explore some additional strategies to improve your custom classification model for invoices.

    When you initially trained the model with 10 documents, it performed decently because it had a limited but specific set of data to learn from. However, when you increased the number of training documents to 99, the model might have encountered more variability and noise, leading to overfitting or misclassification. Here are some advanced steps to address this:

    1. Incremental Training: Instead of training the model with all 99 documents at once, try incremental training. Start with a smaller subset of high-quality, diverse documents and gradually add more data while monitoring the model's performance. This approach can help the model adapt better to new data without being overwhelmed.
    2. Data Augmentation: Enhance your training dataset by including variations of the same document type. This can involve slight modifications in layout, text, or format. Data augmentation helps the model generalize better and reduces the risk of overfitting.
    3. Feature Engineering: Focus on extracting more relevant features from your documents. For instance, consider using additional metadata or contextual information that can help the model distinguish between invoices and non-invoices more accurately.
    4. Model Versioning and Monitoring: Keep track of different versions of your model and their performance metrics. This allows you to compare and roll back to a previous version if the new one doesn't perform as expected. Additionally, continuously monitor the model's performance in production to detect any degradation over time.

    For more detailed guidance, you can refer to Transparency note for Document Intelligence - Azure AI services | Microsoft Learn

    I hope these suggestions help you improve your model's performance. If you have any further questions or need additional assistance, please feel free to ask.

    Please accept as Yes if the answer is helpful so that it can help others in the community.

    Thanks!

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.