Hello MR,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that you would like to specify the max number of tokens cap for individual model deployment.
Fact about your scenario and questions:
- Azure OpenAI does not provide a built-in feature to set a hard max token cap. Applications need to track and enforce token limits manually - https://learn.microsoft.com/en-us/azure/ai-services/openai/quotas-limits
- Azure OpenAI does NOT stop processing requests when a limit is reached; instead, it continues billing. You need to programmatically block further usage instead of using manual, implement a programmatic solution to block further usage.
- To track usage, use the Azure OpenAI Metrics API and store token consumption in a database. - https://learn.microsoft.com/en-us/azure/ai-services/openai/monitor-openai-reference
- To enforce a cap, implement a custom blocking mechanism within the application that stops requests when the threshold is met. - https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/use-blocklists
- For per-minute rate limits, adjust TPM (Tokens Per Minute) settings based on your deployment tier. - https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/quota
- For alerts, use Azure Cost Management and Azure Functions to automatically take action when usage nears the limit. - https://learn.microsoft.com/en-us/azure/cost-management-billing/costs/cost-mgt-alerts-monitor-usage-spending
The solution that can solve the issue is to:
- Track tokens in real-time using tiktoken + Redis/CosmosDB.
- Enforce a per-day token limit (e.g., 55K/day) to avoid early depletion.
- Use Azure Metrics API & Log Analytics for monitoring (not real-time enforcement).
- Scale TPM limits via S2/S3 tiers & implement exponential backoff for exceeded TPM.
- Gracefully degrade service (warnings, throttling, and request queuing) instead of abrupt cutoffs.
I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.