When your Tokens Per Minute (TPM) quota is fully utilized in Azure OpenAI, the service has hit the maximum number of tokens it can process within a minute, causing API requests to become unresponsive or time out. To address the issue, you can try the following:
- Leverage quota reset. Keep in mind that the TPM quota automatically resets at the start of the next minute. However, if you're consistently hitting the limit, you'll experience disruptions repeatedly.
- Optimize current token utilization by ensuring your prompts and expected responses are as concise as possible. Use response streaming to break down longer outputs into smaller chunks, reducing token bursts. Consolidate multiple smaller requests into fewer, larger ones if possible, to better manage token usage.
- Request quota increase
- Navigate to your Azure OpenAI resource.
- Go to the Quota + Usage tab.
- Submit a support request for a higher TPM limit.
- Regularly monitor token usage using Azure metrics to identify spikes or trends that may need attention. Adjust application logic to throttle or queue requests during high usage periods.
- Implement retry mechanisms in your application to handle cases when the service becomes temporarily unavailable due to quota exhaustion.
If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.
hth
Marcin