Documentation about Llama 3.2 11B Vision Instruct Model says 128K context window but not able to process more than 8k tokens

Maheshbabu Boggu 0

I am writing to inquire about the context window of the Llama 3.2 11B Vision Instruct model.

The documentation states that the context window is 128K tokens. However, when using the model, I am unable to provide input exceeding 8192 tokens. I would appreciate it if you could clarify this discrepancy and provide guidance on how to utilize the full 128K context window.

Thank you for your time and assistance.

SriLakshmi C 2,245 Reputation points Microsoft Vendor

2025-01-23T21:18:12.2066667+00:00
Hello Maheshbabu Boggu,

Greetings and Welcome to Microsoft Q&A! Thanks for posting the question.

I understand that you're encountering a limitation that might be related to the specific implementation or configuration of the Llama 3.2 11B Vision Instruct model. While the documentation does state that the model supports a 128K token context window

Here are the few potential causes for that,

API configuration limitations, where specific settings or parameters in your deployment could impact functionality or performance.

Resource constraints, such as quotas or regional resource availability, might restrict optimal operation.

Differences in deployment configurations across environments, such as varying model versions, regions, or associated services, could also contribute to the observed behavior.

To resolve this issue, you can try the following steps:

Check API limits to ensure your requests adhere to the constraints on input size, token limits, or request frequency.

If input size exceeds the allowed limit, utilize chunking techniques to break large content into smaller, manageable parts while maintaining context.

Resource optimization by ensuring efficient use of quotas and scaling resources appropriately to handle your workload.

I Hope this helps. Do let me know if you have any further queries.

Thank you!
SriLakshmi C 2,245 Reputation points Microsoft Vendor

2025-01-24T20:08:49.92+00:00

Hi Maheshbabu Boggu,

Did you get any chance to check the above response. Thank you!
Maheshbabu Boggu 0 Reputation points

2025-01-27T10:05:24.7333333+00:00

Hi SriLakshmi C

Thank you for responding on my query.
I've gone through the documentation for deployment of Llama models in Azure AI Foundry and what I found is there is limit of 1000 API request per min and 200,000 tokens per min which can be found in the link below.
https://learn.microsoft.com/en-us/azure/ai-foundry/model-inference/quotas-limits.

But even though I'm not able to find a way for this configuration to adjust, for using the model to its potential.

It would be great help, if you could help us on this.

Thanks,
Maheshbabu
SriLakshmi C 2,245 Reputation points Microsoft Vendor

2025-01-27T21:03:07.77+00:00

Hi Maheshbabu Boggu,

Sorry for the inconvenience caused,

I recommend reporting this issue to the Azure support team. They will be able to investigate the issue further and provide a more targeted solution and assistance as soon as possible Azure support.

Thank you!

Share via

Documentation about Llama 3.2 11B Vision Instruct Model says 128K context window but not able to process more than 8k tokens

Your answer