Muokkaa

Jaa


Azure OpenAI Service quotas and limits

This article contains a quick reference and a detailed description of the quotas and limits for Azure OpenAI in Azure AI services.

Quotas and limits reference

The following sections provide you with a quick guide to the default quotas and limits that apply to Azure OpenAI:

Limit Name Limit Value
Azure OpenAI resources per region per Azure subscription 30
Default DALL-E 2 quota limits 2 concurrent requests
Default DALL-E 3 quota limits 2 capacity units (6 requests per minute)
Default Whisper quota limits 3 requests per minute
Maximum prompt tokens per request Varies per model. For more information, see Azure OpenAI Service models
Max Standard deployments per resource 32
Max fine-tuned model deployments 5
Total number of training jobs per resource 100
Max simultaneous running training jobs per resource 1
Max training jobs queued 20
Max Files per resource (fine-tuning) 50
Total size of all files per resource (fine-tuning) 1 GB
Max training job time (job will fail if exceeded) 720 hours
Max training job size (tokens in training file) x (# of epochs) 2 Billion
Max size of all files per upload (Azure OpenAI on your data) 16 MB
Max number or inputs in array with /embeddings 2048
Max number of /chat/completions messages 2048
Max number of /chat/completions functions 128
Max number of /chat completions tools 128
Maximum number of Provisioned throughput units per deployment 100,000
Max files per Assistant/thread 10,000 when using the API or Azure AI Foundry portal. In Azure OpenAI Studio the limit was 20.
Max file size for Assistants & fine-tuning 512 MB

200 MB via Azure AI Foundry portal
Max size for all uploaded files for Assistants 100 GB
Assistants token limit 2,000,000 token limit
GPT-4o max images per request (# of images in the messages array/conversation history) 50
GPT-4 vision-preview & GPT-4 turbo-2024-04-09 default max tokens 16

Increase the max_tokens parameter value to avoid truncated responses. GPT-4o max tokens defaults to 4096.
Max number of custom headers in API requests1 10
Max number requests per minute

Current rate limits for real time audio (gpt-4o-realtime-preview) are defined as the number of new websocket connections per minute. For example, 100 requests per minute (RPM) means 100 new connections per minute.
100 new connections per minute

1 Our current APIs allow up to 10 custom headers, which are passed through the pipeline, and returned. Some customers now exceed this header count resulting in HTTP 431 errors. There's no solution for this error, other than to reduce header volume. In future API versions we will no longer pass through custom headers. We recommend customers not depend on custom headers in future system architectures.

Regional quota limits

Region o1-mini o1 GPT-4 GPT-4-32K GPT-4-Turbo GPT-4-Turbo-V gpt-4o gpt-4o-mini GPT-35-Turbo GPT-35-Turbo-Instruct o1-mini - GlobalStandard o1 - GlobalStandard gpt-4o - GlobalStandard gpt-4o-mini - GlobalStandard GPT-4-Turbo - GlobalStandard GPT-4o - Global-Batch GPT-4o-mini - Global-Batch GPT-4 - Global-Batch GPT-4-Turbo - Global-Batch gpt-35-turbo - Global-Batch Text-Embedding-Ada-002 text-embedding-3-small text-embedding-3-large GPT-4o - finetune GPT-4o-mini - finetune GPT-4 - finetune Babbage-002 Babbage-002 - finetune Davinci-002 Davinci-002 - finetune GPT-35-Turbo - finetune GPT-35-Turbo-1106 - finetune GPT-35-Turbo-0125 - finetune
australiaeast - - 40 K 80 K 80 K 30 K - - 300 K - - - 30 M 50 M 2 M - - - - - 350 K - - - - - - - - - - - -
brazilsouth - - - - - - - - - - - - 30 M 50 M 2 M - - - - - 350 K - - - - - - - - - - - -
canadaeast - - 40 K 80 K 80 K - - - 300 K - - - 30 M 50 M 2 M - - - - - 350 K 350 K 350 K - - - - - - - - - -
eastus 1 M 600 K - - 80 K - 1 M 2 M 240 K 240 K 50 M 30 M 30 M 50 M 2 M 5 B 15 B 150 M 300 M 10 B 240 K 350 K 350 K - - - - - - - - - -
eastus2 1 M 600 K - - 80 K - 1 M 2 M 300 K - 50 M 30 M 30 M 50 M 2 M - - - - - 350 K 350 K 350 K 250 K - - - - - - 250 K 250 K 250 K
francecentral - - 20 K 60 K 80 K - - - 240 K - - - 30 M 50 M 2 M - - - - - 240 K - 350 K - - - - - - - - - -
germanywestcentral - - - - - - - - - - - - 30 M 50 M 2 M - - - - - - - - - - - - - - - - - -
japaneast - - - - - 30 K - - 300 K - - - 30 M 50 M 2 M - - - - - 350 K 350 K 350 K - - - - - - - - - -
koreacentral - - - - - - - - - - - - 30 M 50 M 2 M - - - - - - - - - - - - - - - - - -
northcentralus 1 M 600 K - - 80 K - 1 M 2 M 300 K - 50 M 30 M 30 M 50 M 2 M - - - - - 350 K - - 250 K 500 K 100 K 240 K 250 K 240 K 250 K 250 K 250 K 250 K
norwayeast - - - - 150 K - - - - - - - 30 M 50 M 2 M - - - - - 350 K - 350 K - - - - - - - - - -
polandcentral - - - - - - - - - - - - 30 M 50 M 2 M - - - - - - - - - - - - - - - - - -
southafricanorth - - - - - - - - - - - - 30 M 50 M 2 M - - - - - 350 K - - - - - - - - - - - -
southcentralus 1 M 600 K - - 80 K - 1 M 2 M 240 K - 50 M 30 M 30 M 50 M 2 M - - - - - 240 K - - - - - - - - - - - -
southindia - - - - 150 K - - - 300 K - - - 30 M 50 M 2 M - - - - - 350 K - 350 K - - - - - - - - - -
spaincentral - - - - - - - - - - - - 30 M 50 M 2 M - - - - - - - - - - - - - - - - - -
swedencentral 1 M 600 K 40 K 80 K 150 K 30 K 1 M 2 M 300 K 240 K 50 M 30 M 30 M 50 M 2 M 5 B 15 B 150 M 300 M 10 B 350 K - 350 K 250 K 500 K 100 K 240 K 250 K 240 K 250 K 250 K 250 K 250 K
switzerlandnorth - - 40 K 80 K - 30 K - - 300 K - - - 30 M 50 M 2 M - - - - - 350 K - - - - - - - - - - - -
switzerlandwest - - - - - - - - - - - - - - - - - - - - - - - - - - - 250 K - 250 K 250 K 250 K 250 K
uksouth - - - - 80 K - - - 240 K - - - 30 M 50 M 2 M - - - - - 350 K - 350 K - - - - - - - - - -
westeurope - - - - - - - - 240 K - - - 30 M 50 M 2 M - - - - - 240 K - - - - - - - - - - - -
westus 1 M 600 K - - 80 K 30 K 1 M 2 M 300 K - 50 M 30 M 30 M 50 M 2 M 5 B 15 B 150 M 300 M 10 B 350 K - - - - - - - - - - - -
westus3 1 M 600 K - - 80 K - 1 M 2 M 300 K - 50 M 30 M 30 M 50 M 2 M - - - - - 350 K - 350 K - - - - - - - - - -

Global batch limits

Limit Name Limit Value
Max files per resource 500
Max input file size 200 MB
Max requests per file 100,000

Global batch quota

The table shows the batch quota limit. Quota values for global batch are represented in terms of enqueued tokens. When you submit a file for batch processing the number of tokens present in the file are counted. Until the batch job reaches a terminal state, those tokens will count against your total enqueued token limit.

Model Enterprise agreement Default Monthly credit card based subscriptions MSDN subscriptions Azure for Students, Free Trials
gpt-4o 5 B 200 M 50 M 90 K N/A
gpt-4o-mini 15 B 1 B 50 M 90 K N/A
gpt-4-turbo 300 M 80 M 40 M 90 K N/A
gpt-4 150 M 30 M 5 M 100 K N/A
gpt-35-turbo 10 B 1 B 100 M 2 M 50 K

B = billion | M = million | K = thousand

o1-preview & o1-mini rate limits

Important

The ratio of RPM/TPM for quota with o1-series models works differently than older chat completions models:

  • Older chat models: 1 unit of capacity = 6 RPM and 1,000 TPM.
  • o1-preview: 1 unit of capacity = 1 RPM and 6,000 TPM.
  • o1-mini: 1 unit of capacity = 1 RPM per 10,000 TPM.

This is particularly important for programmatic model deployment as this change in RPM/TPM ratio can result in accidental under allocation of quota if one is still assuming the 1:1000 ratio followed by older chat completion models.

There is a known issue with the quota/usages API where it assumes the old ratio applies to the new o1-series models. The API returns the correct base capacity number, but does not apply the correct ratio for the accurate calculation of TPM.

o1-preview & o1-mini global standard

Model Tier Quota Limit in tokens per minute (TPM) Requests per minute
o1-preview Enterprise agreement 30 M 5 K
o1-mini Enterprise agreement 50 M 5 K
o1-preview Default 3 M 500
o1-mini Default 5 M 500

o1-preview & o1-mini standard

Model Tier Quota Limit in tokens per minute (TPM) Requests per minute
o1-preview Enterprise agreement 600 K 100
o1-mini Enterprise agreement 1 M 100
o1-preview Default 300 K 50
o1-mini Default 500 K 50

gpt-4o & GPT-4 Turbo rate limits

gpt-4o and gpt-4o-mini, and gpt-4 (turbo-2024-04-09) have rate limit tiers with higher limits for certain customer types.

gpt-4o & GPT-4 Turbo global standard

Model Tier Quota Limit in tokens per minute (TPM) Requests per minute
gpt-4o Enterprise agreement 30 M 180 K
gpt-4o-mini Enterprise agreement 50 M 300 K
gpt-4 (turbo-2024-04-09) Enterprise agreement 2 M 12 K
gpt-4o Default 450 K 2.7 K
gpt-4o-mini Default 2 M 12 K
gpt-4 (turbo-2024-04-09) Default 450 K 2.7 K

M = million | K = thousand

gpt-4o data zone standard

Model Tier Quota Limit in tokens per minute (TPM) Requests per minute
gpt-4o Enterprise agreement 10 M 60 K
gpt-4o-mini Enterprise agreement 20 M 120 K
gpt-4o Default 300 K 1.8 K
gpt-4o-mini Default 1 M 6 K

M = million | K = thousand

gpt-4o standard

Model Tier Quota Limit in tokens per minute (TPM) Requests per minute
gpt-4o Enterprise agreement 1 M 6 K
gpt-4o-mini Enterprise agreement 2 M 12 K
gpt-4o Default 150 K 900
gpt-4o-mini Default 450 K 2.7 K

M = million | K = thousand

Usage tiers

Global standard deployments use Azure's global infrastructure, dynamically routing customer traffic to the data center with best availability for the customer’s inference requests. Similarly, Data zone standard deployments allow you to leverage Azure global infrastructure to dynamically route traffic to the data center within the Microsoft defined data zone with the best availability for each request. This enables more consistent latency for customers with low to medium levels of traffic. Customers with high sustained levels of usage might see more variability in response latency.

The Usage Limit determines the level of usage above which customers might see larger variability in response latency. A customer’s usage is defined per model and is the total tokens consumed across all deployments in all subscriptions in all regions for a given tenant.

Note

Usage tiers only apply to standard, data zone standard, and global standard deployment types. Usage tiers do not apply to global batch and provisioned throughput deployments.

GPT-4o global standard, data zone standard, & standard

Model Usage Tiers per month
gpt-4o 12 Billion tokens
gpt-4o-mini 85 Billion tokens

GPT-4 standard

Model Usage Tiers per month
gpt-4 + gpt-4-32k (all versions) 6 Billion

Other offer types

If your Azure subscription is linked to certain offer types your max quota values are lower than the values indicated in the above tables.

Tier Quota Limit in tokens per minute (TPM)
Azure for Students, Free Trials 1 K (all models)
MSDN subscriptions GPT 3.5 Turbo Series: 30 K
GPT-4 series: 8 K
Monthly credit card based subscriptions 1 GPT 3.5 Turbo Series: 30 K
GPT-4 series: 8 K

1 This currently applies to offer type 0003P

In the Azure portal you can view what offer type is associated with your subscription by navigating to your subscription and checking the subscriptions overview pane. Offer type corresponds to the plan field in the subscription overview.

General best practices to remain within rate limits

To minimize issues related to rate limits, it's a good idea to use the following techniques:

  • Implement retry logic in your application.
  • Avoid sharp changes in the workload. Increase the workload gradually.
  • Test different load increase patterns.
  • Increase the quota assigned to your deployment. Move quota from another deployment, if necessary.

How to request increases to the default quotas and limits

Quota increase requests can be submitted from the Quotas page in the Azure AI Foundry portal. Due to high demand, quota increase requests are being accepted and will be filled in the order they're received. Priority is given to customers who generate traffic that consumes the existing quota allocation, and your request might be denied if this condition isn't met.

For other rate limits, submit a service request.

Next steps

Explore how to manage quota for your Azure OpenAI deployments. Learn more about the underlying models that power Azure OpenAI.