How to specify the max number of tokens cap for individual model deployment

MR 20 Reputation points
2025-02-15T15:37:12.9733333+00:00

Hello,

We are developing the summarization app where we need to specify the max total token cap limit for every the base model deployment. Let's say we want to 10M total token cap limit to use in six months. I also want to keep Tokens per limit of token per minute more than 10000 (allowing more number of requests per minute) as there are chances that user might be using large text like long email thread to summarize. So we want to make sure that the user gets seamless experience.

Couple of questions:

Is it possible to set the max total cap limit for the model deployment and will AzureOpenAI API not process the requests once the max token limit cap is reached?

What API endpoints can provide model deployments KPIs, like current token usage?

Thank you in advance.

Regards

M

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
3,691 questions
0 comments No comments
{count} votes

Accepted answer
  1. Sina Salam 18,046 Reputation points
    2025-02-15T18:13:18.89+00:00

    Hello MR,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that you would like to specify the max number of tokens cap for individual model deployment.

    Fact about your scenario and questions:

    The solution that can solve the issue is to:

    1. Track tokens in real-time using tiktoken + Redis/CosmosDB.
    2. Enforce a per-day token limit (e.g., 55K/day) to avoid early depletion.
    3. Use Azure Metrics API & Log Analytics for monitoring (not real-time enforcement).
    4. Scale TPM limits via S2/S3 tiers & implement exponential backoff for exceeded TPM.
    5. Gracefully degrade service (warnings, throttling, and request queuing) instead of abrupt cutoffs.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Marcin Policht 36,260 Reputation points MVP
    2025-02-15T16:29:33.9366667+00:00
    1. Setting the max total token cap for model deployment
      • Azure OpenAI does not provide a built-in way to set a hard max total token cap for a model deployment. However, you can track usage manually and enforce the limit in your application logic.
      • What happens when the limit is reached?
        • Azure OpenAI does not automatically block requests when a certain token cap is met. Instead, your subscription is billed based on token usage. You can monitor token consumption and stop requests programmatically once your defined threshold (e.g., 10M tokens over six months) is reached.
    2. Handling requests with a token per minute (TPM) limit
      • Since users might summarize large text blocks (e.g., long email threads), setting a TPM higher than 10,000 ensures seamless experience.
      • Rate limits for Azure OpenAI
        • Azure OpenAI enforces TPM (tokens per minute) and RPM (requests per minute) limits per deployment.
        • The exact limits depend on pricing tier and quota assigned to your Azure subscription.
    3. Monitoring token usage via API
      To track token usage, you can use the Azure OpenAI Metrics API and Azure Monitor Logs: API endpoints for model deployment KPIs:
      • Azure OpenAI Usage API (/metrics)
        • Use the https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/microsoft.cognitiveservices/accounts/{accountName}/metrics?api-version=2023-05-01 endpoint.
        • This provides token usage stats and helps in tracking overall model consumption.
      • Azure Monitor Logs via Log Analytics
        • Azure OpenAI sends logs to Azure Monitor, where you can set up custom alerts for token usage.
      • As an alternative, you can implement manual tracking in your application
        • Track total token usage by storing API responses (usage.total_tokens) in a database.
        • Implement logic to deny requests when the 10M token limit is reached.

    If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

    hth

    Marcin

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.