Hi community,
Yesterday I deployed a serverless of Llama-3.2-11B-Vision-Instruct in my Azure AI studio project and then created a backend route to call it with context (for chat completion). I am not sure if I am missing anything but, Meta says the context length of the model is 128K. I am assuming this is 128K tokens. But when I am calling the Azure AI serverless endpoint, I see this message:
Llama API Error: {"error":{"code":"Bad Request","message":"{"object":"error","message":"This model's maximum context length is 8192 tokens. However, you requested 113137 tokens (111089 in the messages, 2048 in the completion). Please reduce the length of the messages or completion.","type":"BadRequestError","param":null,"code":400}","status":400}}
Also, I can't seem to set the max tokens to anything above 4096.
Is Azure throttling me for some reason? I mean why would it do so if the model has a context length of 128K??
Need some assistance here.
I am using nest js and made a simple backend call to the endpoint.