Azure AI Llama-3.2-11B-Vision-Instruct shows dramatically less context length than what it should

JP 0 Reputation points
2025-01-15T22:24:30.5733333+00:00

Hi community,

Yesterday I deployed a serverless of Llama-3.2-11B-Vision-Instruct in my Azure AI studio project and then created a backend route to call it with context (for chat completion). I am not sure if I am missing anything but, Meta says the context length of the model is 128K. I am assuming this is 128K tokens. But when I am calling the Azure AI serverless endpoint, I see this message:

Llama API Error: {"error":{"code":"Bad Request","message":"{"object":"error","message":"This model's maximum context length is 8192 tokens. However, you requested 113137 tokens (111089 in the messages, 2048 in the completion). Please reduce the length of the messages or completion.","type":"BadRequestError","param":null,"code":400}","status":400}}

Also, I can't seem to set the max tokens to anything above 4096.

Is Azure throttling me for some reason? I mean why would it do so if the model has a context length of 128K??

Need some assistance here.

I am using nest js and made a simple backend call to the endpoint.

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
3,539 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
3,056 questions
{count} votes

2 answers

Sort by: Most helpful
  1. JP 0 Reputation points
    2025-01-17T17:37:06+00:00

    I don't think this answer is accurate. You are putting max_tokens as 1K when I am talking about 128K which is the max length of the model.

    Basically in serverless implementation, the max tokens is capped at 8K. I just heard back from expert support.

    0 comments No comments

  2. JP 0 Reputation points
    2025-01-17T17:53:56.98+00:00

    The real reason is that Microsoft Azure is capping the max_tokens of the models to 8K (internally using configs) which doesn't even make any sense because that is 5% of the context window of the model.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.