Azure AI Foundry Completion Token Limit
Hello, I have deployed a Llama 3.3 70B model using Azure AI Foundry. As you can see in the image below from this page, the output limit should be 8192 tokens.
The problem is that when I use the model with Azure AI Inference Completions, the max token limit is 4096. I see no way to adjust this API limit in AI Foundry. If I try to set max tokens above 4096 the API call gives me an azure.core.exceptions.HttpResponseError: (Bad Request) max_tokens must be less than or equal to 4096.
Azure AI services
-
Vikram Singh • 1,980 Reputation points • Microsoft Employee
2025-02-24T06:18:40.0833333+00:00 Hi Rishab Mehta,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
This issue is due a discrepancy between the expected token limit of 8192 and the enforced API limit of 4096 when using Azure AI Foundry with the Llama 3.3 70B model.
Azure AI Foundry Token Limits: Azure AI Foundry documentation (as of July 2024) states that certain models, including Llama 3.3 70B, support up to 8192 tokens for the total context window (input + output). However, API-level limits may override this based on:
- Deployment SKU: Lower-cost SKUs (e.g., Standard tier) may enforce stricter token limits (e.g., 4096 tokens) to manage resource allocation.
- Regional/Service Constraints: Some Azure regions or AI Inference endpoints may apply default token caps for stability.
Why You’re Seeing a 4096 Token Limit
- API Defaults: Azure AI Completions API often enforces a
max_tokens
parameter cap of 4096 for safety/performance reasons, even if the model supports higher limits. - Deployment Configuration: If your model was deployed without explicit token-limit overrides, it inherits Azure’s default API settings.
Solutions to Increase the Token Limit
Adjust Deployment Configuration
- Redeploy the model in Azure AI Studio with a higher-tier SKU (e.g., Premium or Enterprise SKUs) that explicitly supports 8192 tokens.
- Specify
max_total_tokens=8192
in the deployment configuration file (e.g.,deployment.yml
):
model: name: llama-3-70b version: 3.3 compute: sku: Premium properties: max_total_tokens: 8192
API Request Parameters: Ensure your Completions API request includes
max_tokens=8192
. For example:response = client.completions.create( model="your-deployment-id", prompt="Your prompt here", max_tokens=8192 # Explicitly set this value )
If this fails, the deployment SKU or API endpoint may not support the higher limit. Unfortunately, there is no way to adjust this API limit directly within AI Foundry. You can either optimize your prompts or use streaming to manage longer outputs.
I hope this helps! or do let me know if you have any other questions.
Thanks
-
Rishab Mehta • 80 Reputation points
2025-02-24T19:09:49.94+00:00 Thanks for the response! I do have a couple of questions:
Is there some way to know what SKU tier the model has been deployed with, and someplace I can see the various SKU tier options for AI Foundry? I tried to deploy the model again, and I can't figure how to choose the tier I'm deploying with.
Is the only way to deploy a model with higher max tokens through a deployment file, it can't be done when deploying from the AI Foundry web console interface?
Finally, is the only way to know whether or not this is possible for the Llama model trying to deploy? Is there no documentation for this other than just trying and checking if the deployment succeeds?
-
Vikram Singh • 1,980 Reputation points • Microsoft Employee
2025-02-25T05:44:01.94+00:00 Hi Rishab Mehta,
I'm glad the previous response was helpful. Let's address your follow-up questions one by one:
Is there some way to know what SKU tier the model has been deployed with, and someplace I can see the various SKU tier options for AI Foundry? I tried to deploy the model again, and I can't figure how to choose the tier I'm deploying with.
You can check the SKU under the Virtual Machine while deploying the model like below
For Llama-3.3-70B-Instruct, common SKUs include:
- GPU-Intensive: Standard_NC24s_v3, Standard_ND40rs_v2
- High Memory: Standard_E64s_v3
Make sure you check Azure VM SKUs for region-specific availability, as there may not be any premium tier configuration for this model.
Refer: Azure VM SKUs
Is the only way to deploy a model with higher max tokens through a deployment file, it can't be done when deploying from the AI Foundry web console interface?
Currently, deploying a model with higher max tokens through the AI Foundry web console interface is not supported. You need to use a deployment file to specify the
max_total_tokens
parameter. Here’s how you can do it:- Create a Deployment File: Use a YAML file to define the deployment configuration, including the
max_total_tokens
parameter. Sample example:model: name: llama-3-70b version: 3.3 compute: sku: Premium properties: max_total_tokens: 8192
- Deploy Using the Deployment File: Upload the deployment file through the Azure AI Foundry portal or use the Azure CLI to deploy the model with the specified configuration. Refer below for more information. Refer below for more info: Use Terraform to create an Azure AI Foundry hub - Azure AI Foundry | Microsoft Learn
Finally, is the only way to know whether or not this is possible for the Llama model trying to deploy? Is there no documentation for this other than just trying and checking if the deployment succeeds?
To know whether deploying the Llama model with higher max tokens is possible, you can refer to the Azure AI Foundry documentation. For more info, please refer to:
- How to use the Meta Llama family of models with Azure AI Foundry - Azure AI Foundry | Microsoft Learn
- Fine-tune Llama models in Azure AI Foundry portal - Azure AI Foundry | Microsoft Learn
I hope this is helpful!
Thanks
-
Rishab Mehta • 80 Reputation points
2025-02-25T17:46:44.6966667+00:00 1 thing I'd like to clarify:
The Llama model was deployed as a serverless model, so there is no VM selection as far as I know.
-
Vikram Singh • 1,980 Reputation points • Microsoft Employee
2025-02-26T08:07:47.97+00:00 Thank you for the clarification! Since the Llama model was deployed as a serverless model, there wouldn't be a VM selection involved. In this case, the SKU tier and configuration options might be managed differently.
If you need to deploy a model with higher max tokens, you would still need to use a deployment file to specify the 'max_total_tokens' parameter, as mentioned earlier. This approach should work for serverless deployments as well.
I hope this is helpful!
Thanks
-
Vikram Singh • 1,980 Reputation points • Microsoft Employee
2025-02-28T05:20:28.1233333+00:00 Greetings.
Just following up to check if my suggestion helped. Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.
Thank you
Sign in to comment