Azure AI Llama-3.2-11B-Vision-Instruct shows dramatically less context length than what it should

JP 0

Hi community,

Yesterday I deployed a serverless of Llama-3.2-11B-Vision-Instruct in my Azure AI studio project and then created a backend route to call it with context (for chat completion). I am not sure if I am missing anything but, Meta says the context length of the model is 128K. I am assuming this is 128K tokens. But when I am calling the Azure AI serverless endpoint, I see this message:

Llama API Error: {"error":{"code":"Bad Request","message":"{"object":"error","message":"This model's maximum context length is 8192 tokens. However, you requested 113137 tokens (111089 in the messages, 2048 in the completion). Please reduce the length of the messages or completion.","type":"BadRequestError","param":null,"code":400}","status":400}}

Also, I can't seem to set the max tokens to anything above 4096.

Is Azure throttling me for some reason? I mean why would it do so if the model has a context length of 128K??

Need some assistance here.

I am using nest js and made a simple backend call to the endpoint.

kothapally Snigdha 870 Reputation points Microsoft Vendor

2025-01-16T11:42:13.34+00:00

Hi JP

Greetings & Welcome to the Microsoft Q&A forum! Thank you for sharing your query.

The error message suggests that the max token parameter might be set to 8192 tokens. To resolve this, it is recommended to increase the max token parameter to a higher value, such as 130k tokens. However, I would also considering an alternative approach to optimize the response rate.

Instead of solely increasing the max token parameter, you might want to lower the input size and adopt a multi shot learning approach. This method involves breaking down the input into smaller segments and processing them sequentially to arrive at the desired answer. By doing so, you can achieve faster response times while maintaining the accuracy of the results. can you refer this document https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/quota?tabs=rest#understanding-rate-limits

I hope this helps you. Thank you.
JP 0 Reputation points

2025-01-16T19:26:13.23+00:00

Hi kothapally,

How do I exactly increase the max tokens to 130K? Also the limit of this AI model is 128K. Anyways, everything aside, Azure AI studio doesn't have any option to increase or decrease context size. If you could tell me how to control the context size limit, that would be very helpful.

kothapally Snigdha 870 Microsoft Vendor

Hi JP

You have to declare it under max n_ew tokens param in your input Json file which will be invoked against the endpoint.

import json
with open(sample_data, "w") as f:
    json.dump(
        {
            "input_data": {
                "input_string": [
                    {
                        "role": "user",
                        "content": "A 51-year-old man was found dead in his car. There were blood stains on the dashboard and windscreen. At autopsy, a deep, oblique, long incised injury was found on the front of the neck. It turns out that he died by suicide.",
                    },
                ],
                "parameters": {
                    "temperature": 0.9,
                    "top_p": 0.6,
                    "do_sample": True,
                    "max_new_tokens": 1000,
                },
            }
        },
        f,
    )
 
ml_client.online_endpoints.invoke(
    endpoint_name=endpoint_name,
    deployment_name=deployment_name,
    request_file=sample_data,
)

curl --location --request POST $scoringUri \ --header "Authorization: Bearer $accessToken" \ --header "Content-Type: application/json" \ --data-raw @endpoints/online/model-1/sample-request.json

please refer the link https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/llama2/webrequests.ipynb

I hope this helps you Thank you.

2 answers

JP 0 Reputation points

2025-01-17T17:37:06+00:00

I don't think this answer is accurate. You are putting max_tokens as 1K when I am talking about 128K which is the max length of the model.

Basically in serverless implementation, the max tokens is capped at 8K. I just heard back from expert support.
Please sign in to rate this answer.

0 comments No comments
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.
JP 0 Reputation points

2025-01-17T17:53:56.98+00:00

The real reason is that Microsoft Azure is capping the max_tokens of the models to 8K (internally using configs) which doesn't even make any sense because that is 5% of the context window of the model.
Please sign in to rate this answer.

0 comments No comments
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

Azure AI Llama-3.2-11B-Vision-Instruct shows dramatically less context length than what it should

2 answers

Your answer