Training Custom Extraction Model in Azure AI Document Intelligence for Medical Data

Mike Schuckenbrock 0 Reputation points
2025-03-10T20:53:37.26+00:00

Seeking advice on effectively training a custom extraction model in Azure AI Document Intelligence, specifically for analyzing cancer treatment documents. The model is designed to extract key data such as chemotherapy received, radiation sites and levels, and surgeries, but it faces several challenges:

  1. Chemotherapy Information: The model recognizes terms in a string containing chemotherapy drugs and biologic agents, but it has low confidence levels and is not fully accurate despite labeling specific terms for the appropriate fields.
  2. Radiation Sites and Doses: While the model captures data accurately, the confidence levels remain too low to be reliable.
  3. Surgery Listings: The model identifies the first surgery correctly but struggles to extract multiple surgeries, even with training examples provided.

Approximately 30-40 example documents have been labeled, covering a range of values across chemotherapy drugs, biologic agents, radiation sites, and different types and numbers of surgeries. Despite following guidelines from Azure documentation, the model continues to struggle with accuracy and confidence levels. Any recommendations for enhancing training techniques would be greatly appreciated.

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,968 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Q&A Assist
    2025-03-10T20:53:37.28+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    To enhance the training of your custom extraction model in Azure AI Document Intelligence for analyzing cancer treatment documents, consider the following recommendations based on best practices:

    1. Increase Training Data: Although you have labeled 30-40 example documents, increasing the dataset can significantly improve model performance. Aim for at least 10-15 examples for each variation of chemotherapy drugs, radiation sites, and surgeries. This helps the model learn from a more diverse set of examples.
    2. Labeling Consistency: Ensure that the terms used for labeling are consistent across all documents. For instance, if you have different ways of describing chemotherapy drugs or surgeries, standardize these terms to help the model recognize them better.
    3. Field Naming: When labeling fields, use clear and relevant names that reflect the content. For example, instead of generic names, use specific identifiers like chemotherapy_drugs or surgery_list. This can improve the accuracy of key-value pair extraction.
    4. Address Variations: Since the model struggles with multiple surgeries, ensure that your training data includes examples with varying numbers of surgeries. Label each surgery distinctly to help the model learn to identify and extract multiple entries.
    5. Quality of Documents: Ensure that the documents used for training are of high quality. Low-quality scans or images can affect the model's ability to accurately extract information. Aim for clear, high-resolution images.
    6. Review and Adjust Training Data: If the model shows low confidence levels, review the training data for any inconsistencies or missing examples. Adding more labeled documents that reflect the expected variations can help improve confidence.
    7. Test and Iterate: Regularly test the model with new documents and iterate on the training process based on the results. If certain fields consistently yield low confidence, consider retraining the model with additional examples focused on those areas.

    By applying these strategies, you should be able to enhance the accuracy and confidence levels of your custom extraction model.


    References:

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.