Optimizing Machine Learning / Only process new data

Thomas 41 Reputation points
2025-01-14T07:24:07.51+00:00

Hi all,

I have deployed a small setup through Azure AI Studio and everything is working fine. I am using Azure OpenAI Services (GPT4o-mini). As I add new data to my datastore, I need to update the index accordingly (ideally daily). To do this, I have scheduled the original job that was created in Machine Learning Studio (ml.azure.com). However, this appears to always process all the data that is in the data store and this will run for ~15 hours (22k html files, mostly <5kb per file). I am adding/modifying about 50 files per day and would like to only add them to the index. Is this possible? How?

I am aware that I would need to clean the index (re-create) it occasionally to remove old/obsolete data. Unless anyone knows a better way?

The other question is that the training (serverless) is running for a very long time ~15 hours. Is this expected or can this be optimized (other than using a compute instance instead of serverless).

Thanks for your help/input!

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,073 questions
0 comments No comments
{count} votes

Accepted answer
  1. Amira Bedhiafi 27,596 Reputation points
    2025-01-14T20:26:41.27+00:00

    Instead of processing all 22k HTML files every time, you can implement an incremental data update process where you only process the newly added or modified files.

    Implement a mechanism to track which files are new or modified. You could store a timestamp or a hash value for each file (in a metadata store or database) and check this before processing. For example, compare the timestamp of the last processed file with the timestamp of the newly added files.

    If you're storing files in Azure Blob Storage, you can use the blob's last modified time to identify new or modified files.

    And then you need to modify your processing pipeline in Azure Machine Learning to only include the new or modified files instead of reprocessing all 22k files. You could write a script to filter out already indexed files before sending them for processing.

    For the other part, you're correct that occasionally cleaning the index will be necessary, especially if you want to remove obsolete or old data.

    Instead of re-creating the entire index, you might be able to perform partial updates (e.g., deleting only obsolete entries). This depends on the specific indexing method you're using.

    If your index platform supports it (e.g., Azure Cognitive Search), you could set up incremental updates to only add, update, or delete specific documents based on file changes.

    Training time for serverless models can be long depending on the complexity and scale of the data. First thing to do is to review the model complexity. If you're using a custom GPT model or fine-tuning, you may need to try to simplify or optimize the model architecture if possible.

    For training on large datasets, you should consider breaking the data into smaller chunks and processing them in parallel if possible. This can be done using a distributed processing setup in Azure.

    While you've mentioned the serverless compute option, for large datasets, switching to a dedicated Azure machine learning compute instance can reduce training time by providing more resources.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.