The Azure OpenAI service has a limit of 40 million tokens per month for each model. The GPT-4o-mini model has a limit of 40 million tokens as well. Therefore, you need to make sure that you are not exceeding the token limit for the model. To achieve your goal of processing 15,000 * 50,000 tokens in 8 hours, you need to make sure that you are not exceeding the rate limit of the Azure OpenAI service. The rate limit for the service is 1,200 transactions per minute. Each transaction can contain a maximum of 2,048 tokens. To process 15,000 * 50,000 tokens in 8 hours, you need to make 375 API calls per minute. Each API call should contain 40 transactions, with each transaction containing 10,000 input tokens and 7,000 output tokens. However, this is just a theoretical calculation and it may not be possible to achieve this rate limit in practice.
To optimize the performance of your solution, you can try the following techniques: -
Implement retry logic in your application to handle any errors or timeouts.
Avoid sharp changes in the workload. Increase the workload gradually.
Test different load increase patterns to find the optimal rate limit for your solution.
- Create another OpenAI service resource in the same or different regions and distribute the workload among them.
Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.
I hope this helps! Let me know if you have any further questions