Clarification on Token Rate Limits for S0 Standard Pricing Tier and Global Standard Deployment

Question

I am encountering the following error while using the Azure OpenAI service:

openai.RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2024-10-21 have exceeded token rate limit of your current OpenAI S0 pricing tier. Please retry after 86400 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.'}}

I understand that this indicates a token rate limit issue and that I might need to request an increase. However, I have some specific questions for clarification:

Pricing Tier and Deployment Type:

My pricing tier is S0-Standard as shown on the home page of the resource.

The deployment type is Global Standard, which, as per documentation, has a default rate limit of 450k tokens per minute.

Are the S0-Standard pricing tier and the Global Standard deployment type distinct? If so, which one determines the applicable token rate limit?

Configured Token Limit:

While deploying the model, I could set a tokens per minute rate limit, with a maximum of 30k tokens per minute for my configuration.

Given that Global Standard defaults to 450k tokens per minute, does the limit I set during deployment (30k tokens per minute) override the Global Standard limit?

Rate Limit Increase:

To address this issue, should I request an increase to the S0-Standard pricing tier, the Deployment type as there are 6 types of deployment type, or both?

Also when i increase the either two, then will it affect the tokens per minute rate limit when we deploy the model which currently can be maximized to 30k?

There are 6 types of deployment

Global Standard

Global Batch

Global Provisioned

Data Zone Standard

Standard

Provisioned

For context, my current configuration is as follows:

Pricing Tier: S0-Standard

Deployment Type: Global Standard - 450k/min

Tokens per Minute Rate Limit (set during deployment): 30k/min

So what will be the token limit per minute here ?

Accepted Answer

I understand you're running into an issue with Rate Limits and looking for further clarification on how Rate Limits are impacted by Pricing Tier, Deployment Type, and Configured Token Limit.

Let's breakdown your model deployment. I didn't see you mention a specific model so I'll assume a in 4o - Global Standard East US as it is one of the examples in the Azure OpenAI Documentation.

To start you set up a Azure OpenAI resource with standard billing in a specific Azure region. After which you can navigate to AI Foundry (Azure OpenAI Service) to deploy a Model. Here is an excerpt from Azure an Azure Learn Article that further clarifies how Deployment Type, Region, Subscription, and Model impact Quota and TPM.:

"Azure OpenAI's quota feature enables assignment of rate limits to your deployments, up-to a global limit called your “quota.” Quota is assigned to your subscription on a per-region, per-model basis in units of Tokens-per-Minute (TPM). When you onboard a subscription to Azure OpenAI, you'll receive default quota for most available models. Then, you'll assign TPM to each deployment as it is created, and the available quota for that model will be reduced by that amount. You can continue to create deployments and assign them TPM until you reach your quota limit. Once that happens, you can only create new deployments of that model by reducing the TPM assigned to other deployments of the same model (thus freeing TPM for use), or by requesting and being approved for a model quota increase in the desired region."
and an example of this:

"*With a quota of 240,000 TPM for GPT-35-Turbo in East US, a customer can create a single deployment of 240K TPM, 2 deployments of 120K TPM each, or any number of deployments in one or multiple Azure OpenAI resources as long as their TPM adds up to less than 240K total in that region."
*
Quota - Set on a subscription, per region, per model basis.

TPM - Set on a specific Model Deployment, aggregated for Quota.

To answer your questions:
To address this issue, should I request an increase to the S0-Standard pricing tier, the Deployment type as there are 6 types of deployment type, or both? If you are trying to increase the TPM's for a specific model to avoid a rate limit issue.

-First try increasing the TPM's on that model using the Quota page of AI Foundry.

User's image

Also when i increase the either two, then will it affect the tokens per minute rate limit when we deploy the model which currently can be maximized to 30k.

Let me know if this is helpful or if you need more information
Max

Share via

Clarification on Token Rate Limits for S0 Standard Pricing Tier and Global Standard Deployment

0 additional answers

Your answer