Azure AI Foundry Serverless Model Rate Limits

Rishab Mehta 60

Hello,

This page: https://learn.microsoft.com/en-us/azure/ai-foundry/model-inference/quotas-limits

says that the Rate Limits for deployed serverless models is 200.000 tokens per minute and 1.000 requests per minute, but this page: https://learn.microsoft.com/en-us/azure/ai-studio/how-to/fine-tune-serverless?tabs=chat-completion says that each deployment has a rate limit of 200,000 tokens per minute and 1,000 API requests per minute.

So is it 200 and 1 or 200,000 and 1,000?

kothapally Snigdha 1,595 Reputation points Microsoft Vendor

2025-02-22T01:13:16.93+00:00
Hi Rishab Mehta.

Based on the document you provided, the rate limits for models other than Azure OpenAI models are:

200,000 Tokens per minute

1,000 Requests per minute

The document explicitly states that the limits for Azure OpenAI models vary per model and SKU, and directs you to the Azure OpenAI Service documentation for those specific limits.

It appears there might be a discrepancy or lack of clarity in the documentation. The first document refers to "rest of models" having the higher rate limits, while the second document specifies those same limits for fine-tuned serverless deployments.

Submit a service request as suggested in the document, to clarify the exact rate limits for your specific scenario (fine-tuned serverless models

I hope these helps you. thank you!

Your answer