To enhance the training of your custom extraction model in Azure AI Document Intelligence for analyzing cancer treatment documents, consider the following recommendations based on best practices:
- Increase Training Data: Although you have labeled 30-40 example documents, increasing the dataset can significantly improve model performance. Aim for at least 10-15 examples for each variation of chemotherapy drugs, radiation sites, and surgeries. This helps the model learn from a more diverse set of examples.
- Labeling Consistency: Ensure that the terms used for labeling are consistent across all documents. For instance, if you have different ways of describing chemotherapy drugs or surgeries, standardize these terms to help the model recognize them better.
- Field Naming: When labeling fields, use clear and relevant names that reflect the content. For example, instead of generic names, use specific identifiers like
chemotherapy_drugs
orsurgery_list
. This can improve the accuracy of key-value pair extraction. - Address Variations: Since the model struggles with multiple surgeries, ensure that your training data includes examples with varying numbers of surgeries. Label each surgery distinctly to help the model learn to identify and extract multiple entries.
- Quality of Documents: Ensure that the documents used for training are of high quality. Low-quality scans or images can affect the model's ability to accurately extract information. Aim for clear, high-resolution images.
- Review and Adjust Training Data: If the model shows low confidence levels, review the training data for any inconsistencies or missing examples. Adding more labeled documents that reflect the expected variations can help improve confidence.
- Test and Iterate: Regularly test the model with new documents and iterate on the training process based on the results. If certain fields consistently yield low confidence, consider retraining the model with additional examples focused on those areas.
By applying these strategies, you should be able to enhance the accuracy and confidence levels of your custom extraction model.
References: