Throttling issue with OpenAI API

Tung Nguyen Xuan 40

Hi, I'm using the Azure OpenAI API in my RAG project and noticed something unusual. In one of the steps, I send 3-4 chat completion requests concurrently. However, the response time appears to be much longer compared to a single request, suggesting that the requests might not be processed concurrently on Azure OpenAI's side. Could you explain why this happens and what options I have to address it?

Deployment detail: gpt-4o, Model version 2024-08-06, eastus, deployment type standard, Tokens per Minute Rate Limit set to maximum (400K)

I attached the actual log when making 4 requests. There's a 9-second gap between the first and the last response.

User's image

Pavankumar Purilla 2,595 Reputation points Microsoft Vendor

2025-01-16T00:58:27.7+00:00

Hi Tung Nguyen Xuan,
Greetings & Welcome to Microsoft Q&A forum! Thanks for posting your query!

It seems you are encountering throttling issues with the Azure OpenAI API when sending multiple concurrent requests.
Even though your deployment supports a high throughput limit of 400K tokens per minute, requests may be processed sequentially if the concurrency limit is reached or exceeded. This can happen because gpt-4o is resource-intensive, and simultaneous requests can lead to increased latency or queuing. Additionally, shared infrastructure in highly utilized regions like East US may contribute to the delays. To address this, you can consider increasing the number of parallel deployments and distributing requests across them, optimizing prompts to reduce token usage, or batching requests into larger single requests.
If the issue persists, reach out to Microsoft support for assistance.
I hope this information helps. Thank you!
Pavankumar Purilla 2,595 Reputation points Microsoft Vendor

2025-01-17T03:03:17.1466667+00:00

Hi Tung Nguyen Xuan,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Tung Nguyen Xuan 40 Reputation points

2025-01-17T04:46:02.5733333+00:00

@Pavankumar Purilla

Thanks for your response. I'm trying to reduce the token usage.However, today I even get 503 erroropenai.InternalServerError: Error code: 503 - {'error': {'code': 'InternalServerError', 'message': 'The service is temporarily unable to process your request. Please try again later.'}}

This never happened (to me) before. Is there an issue on Azure's side?

Deployment detail: gpt-4o, Model version 2024-11-20, OPENAI API VERSION 2024-10-21, eastus, deployment type Global Standard, Tokens per Minute Rate Limit 13,565,000
Pavankumar Purilla 2,595 Reputation points Microsoft Vendor

2025-01-18T02:56:58.5+00:00

Hi Tung Nguyen Xuan,
Hope you are doing well.
Thank you for confirming the deployment details.

I was able to reproduce the scenario. Could you please check it once? Please let us know if you encounter any issues.

Share via

Throttling issue with OpenAI API

Your answer