How to specify the max number of tokens cap for individual model deployment

Question

Hello,

We are developing the summarization app where we need to specify the max total token cap limit for every the base model deployment. Let's say we want to 10M total token cap limit to use in six months. I also want to keep Tokens per limit of token per minute more than 10000 (allowing more number of requests per minute) as there are chances that user might be using large text like long email thread to summarize. So we want to make sure that the user gets seamless experience.

Couple of questions:

Is it possible to set the max total cap limit for the model deployment and will AzureOpenAI API not process the requests once the max token limit cap is reached?

What API endpoints can provide model deployments KPIs, like current token usage?

Thank you in advance.

Regards

M

Accepted Answer

Hello MR,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you would like to specify the max number of tokens cap for individual model deployment.

Fact about your scenario and questions:

Azure OpenAI does not provide a built-in feature to set a hard max token cap. Applications need to track and enforce token limits manually - https://learn.microsoft.com/en-us/azure/ai-services/openai/quotas-limits
Azure OpenAI does NOT stop processing requests when a limit is reached; instead, it continues billing. You need to programmatically block further usage instead of using manual, implement a programmatic solution to block further usage.
To track usage, use the Azure OpenAI Metrics API and store token consumption in a database. - https://learn.microsoft.com/en-us/azure/ai-services/openai/monitor-openai-reference
To enforce a cap, implement a custom blocking mechanism within the application that stops requests when the threshold is met. - https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/use-blocklists
For per-minute rate limits, adjust TPM (Tokens Per Minute) settings based on your deployment tier. - https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/quota
For alerts, use Azure Cost Management and Azure Functions to automatically take action when usage nears the limit. - https://learn.microsoft.com/en-us/azure/cost-management-billing/costs/cost-mgt-alerts-monitor-usage-spending

The solution that can solve the issue is to:

Track tokens in real-time using tiktoken + Redis/CosmosDB.
Enforce a per-day token limit (e.g., 55K/day) to avoid early depletion.
Use Azure Metrics API & Log Analytics for monitoring (not real-time enforcement).
Scale TPM limits via S2/S3 tiers & implement exponential backoff for exceeded TPM.
Gracefully degrade service (warnings, throttling, and request queuing) instead of abrupt cutoffs.

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Answer

Setting the max total token cap for model deployment
- Azure OpenAI does not provide a built-in way to set a hard max total token cap for a model deployment. However, you can track usage manually and enforce the limit in your application logic.
- What happens when the limit is reached?
  - Azure OpenAI does not automatically block requests when a certain token cap is met. Instead, your subscription is billed based on token usage. You can monitor token consumption and stop requests programmatically once your defined threshold (e.g., 10M tokens over six months) is reached.
Handling requests with a token per minute (TPM) limit
- Since users might summarize large text blocks (e.g., long email threads), setting a TPM higher than 10,000 ensures seamless experience.
- Rate limits for Azure OpenAI
  - Azure OpenAI enforces TPM (tokens per minute) and RPM (requests per minute) limits per deployment.
  - The exact limits depend on pricing tier and quota assigned to your Azure subscription.
Monitoring token usage via API
To track token usage, you can use the Azure OpenAI Metrics API and Azure Monitor Logs: API endpoints for model deployment KPIs:
- Azure OpenAI Usage API (/metrics)
  - Use the https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/microsoft.cognitiveservices/accounts/{accountName}/metrics?api-version=2023-05-01 endpoint.
  - This provides token usage stats and helps in tracking overall model consumption.
- Azure Monitor Logs via Log Analytics
  - Azure OpenAI sends logs to Azure Monitor, where you can set up custom alerts for token usage.
- As an alternative, you can implement manual tracking in your application
  - Track total token usage by storing API responses (usage.total_tokens) in a database.
  - Implement logic to deny requests when the 10M token limit is reached.

If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

hth

Marcin

Share via

How to specify the max number of tokens cap for individual model deployment

1 additional answer

Your answer