Azure AI model inference quotas and limits in Azure AI services
This article contains a quick reference and a detailed description of the quotas and limits for Azure AI model's inference in Azure AI services. For quotas and limits specific to the Azure OpenAI Service, see Quota and limits in the Azure OpenAI service.
Quotas and limits reference
The following sections provide you with a quick guide to the default quotas and limits that apply to Azure AI model's inference service in Azure AI services:
Resource limits
Limit name | Limit value |
---|---|
Azure AI services resources per region per Azure subscription | 30 |
Max deployments per resource | 32 |
Rate limits
Limit name | Limit value |
---|---|
Tokens per minute (Azure OpenAI models) | Varies per model and SKU. See limits for Azure OpenAI. |
Tokens per minute (rest of models) | 200.000 |
Requests per minute (Azure OpenAI models) | Varies per model and SKU. See limits for Azure OpenAI. |
Requests per minute (rest of models) | 1.000 |
Other limits
Limit name | Limit value |
---|---|
Max number of custom headers in API requests1 | 10 |
1 Our current APIs allow up to 10 custom headers, which are passed through the pipeline, and returned. We have noticed some customers now exceed this header count resulting in HTTP 431 errors. There is no solution for this error, other than to reduce header volume. In future API versions we will no longer pass through custom headers. We recommend customers not depend on custom headers in future system architectures.
Usage tiers
Global Standard deployments use Azure's global infrastructure, dynamically routing customer traffic to the data center with best availability for the customer's inference requests. This enables more consistent latency for customers with low to medium levels of traffic. Customers with high sustained levels of usage might see more variabilities in response latency.
The Usage Limit determines the level of usage above which customers might see larger variability in response latency. A customer's usage is defined per model and is the total tokens consumed across all deployments in all subscriptions in all regions for a given tenant.
General best practices to remain within rate limits
To minimize issues related to rate limits, it's a good idea to use the following techniques:
- Implement retry logic in your application.
- Avoid sharp changes in the workload. Increase the workload gradually.
- Test different load increase patterns.
- Increase the quota assigned to your deployment. Move quota from another deployment, if necessary.
Request increases to the default quotas and limits
Quota increase requests can be submitted and evaluated per request. Submit a service request.
Next steps
- Learn more about the models available in the Azure AI model's inference service