What options does Azure OpenAI provide for managing TPM quotas and rate limits?

Jeremy Lau 20 Reputation points
2025-02-20T13:53:22.78+00:00

I currently have an Azure OpenAI gpt4o Global standard deployment in Australia East region.

I am finding with my OpenAI deployment that I am hitting the TPM rate limit and receiving back 429 responses from the API.

Without defaulting to suggesting that I increase my deployments TPM rate limit configuration I am wondering what other options can be suggested to help manage volatility in traffic to the API.

  1. Is there an optional where I can programmatically update the TPM rate limit of my deployment (e.g through an API call to AzureOpenAI) when my application perceives that there is going to be a spike in traffic?
  2. We're using the Open AI gpt4o endpoint in an agentic application and this is causing the context window to exponentially grow overtime as the conversation between the agent and a user goes back and fourth in a short period of time, we're wondering whether others or Azure has a recommendation on how to handle a growing context window without compromising the ability for the LLM to respond effectively (in both quality and speed of response) which inadvertently is causing us to hit the rate limit I believe?
  3. Does Azure typically provide pay as you go customers / non-enterprise customers with a total TPM quota for 5million or more?

Any thoughts or help would be so appreciated please!

Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
3,203 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Amira Bedhiafi 29,711 Reputation points
    2025-02-24T11:31:42.77+00:00

    I am not expert of this subject, but I tried to make some researches online :

    What options does Azure OpenAI provide for managing TPM quotas and rate limits?

    Azure OpenAI offers several strategies to help manage TPM (Tokens Per Minute) quotas and rate limits. While increasing the TPM rate limit is a straightforward solution, there are other approaches to handle traffic volatility. For instance, you can implement client-side rate limiting in your application to ensure that requests stay within the allocated TPM.

    You can use retry logic with exponential backoff to handle 429 responses gracefully, allowing your application to recover from temporary spikes in traffic.

    Another option is to distribute traffic across multiple deployments or regions, which can help balance the load and reduce the likelihood of hitting rate limits.

    Is there an option to programmatically update the TPM rate limit of a deployment?

    Currently, Azure OpenAI doesn't provide an API to programmatically adjust the TPM rate limit of a deployment in real-time. Their limits are typically set at the deployment level and require manual adjustment through the Azure portal or by contacting Azure support.

    However, you can monitor your usage and proactively adjust the TPM limit based on anticipated traffic spikes.

    How can you handle a growing context window without compromising response quality and speed?

    You can summarize or truncate older parts of the conversation to reduce the token count while retaining essential information. Another approach is to use a sliding window mechanism, where only the most recent interactions are included in the context.

    You can explore fine-tuning the model or using smaller, more efficient models for specific tasks to reduce token consumption.

    Does Azure typically provide pay-as-you-go or non-enterprise customers with a TPM quota of 5 million or more?

    The TPM quotas vary depending on the subscription tier, region, and model type. Pay-as-you-go and non-enterprise customers may have lower default TPM limits compared to enterprise customers.

    However, you can request higher TPM quotas by contacting Azure support and providing justification for your usage needs.

    0 comments No comments

  2. SriLakshmi C 3,015 Reputation points Microsoft External Staff
    2025-02-24T11:45:57.3066667+00:00

    @Jeremy Lau

    Greetings and Welcome to Microsoft Q&A!

    I understand that you are experiencing TPM (tokens per minute) rate limit issues with their Azure OpenAI GPT-4o deployment in Australia East.

    Is there an optional where I can programmatically update the TPM rate limit of my deployment (e.g through an API call to AzureOpenAI) when my application perceives that there is going to be a spike in traffic?

    Currently, Azure OpenAI does not support dynamic, programmatic adjustments to TPM (Tokens Per Minute) rate limits via API calls. To modify TPM allocations, you must manually adjust the settings through the Azure portal or submit a quota increase request. While there is no direct API to update TPM rate limits, Azure OpenAI offers a dynamic quota feature that allows deployments to utilize additional capacity when available. This feature helps manage traffic spikes without requiring manual intervention, ensuring better flexibility in handling fluctuating workloads.

    Please refer this Automate deployment.

    We're using the Open AI gpt4o endpoint in an agentic application and this is causing the context window to exponentially grow overtime as the conversation between the agent and a user goes back and fourth in a short period of time, we're wondering whether others or Azure has a recommendation on how to handle a growing context window without compromising the ability for the LLM to respond effectively (in both quality and speed of response) which inadvertently is causing us to hit the rate limit I believe?

    Azure OpenAI's Assistants API offers persistent, automatically managed threads, effectively maintaining conversation state while optimizing context window usage. This eliminates the need for manual context handling, ensuring a seamless user experience.

    Implementing Token-Efficient Formatting helps minimize token usage by structuring messages concisely and avoiding redundant inputs.

    Context Trimming & Summarization further enhances efficiency by summarizing earlier interactions and retaining only the most relevant details, reducing unnecessary token consumption.

    Kindly refer this Context window management.

    Does Azure typically provide pay as you go customers / non-enterprise customers with a total TPM quota for 5million or more?

    Azure assigns default TPM quotas based on the subscription type and the selected model. Enterprise agreements generally receive higher quotas, while pay-as-you-go customers have lower default limits. However, pay-as-you-go users can request a quota increase by submitting a detailed request outlining their specific requirements.

    Kindly refer this Manage Azure OpenAI Service quota.

    I hope you understand. And, if you have any further query do let us know.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful.

    Thank you!


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.