Scaling a RAG chat app with Azure OpenAI and Azure AI Search

Tomás Novo 0 Reputation points
2025-01-31T10:29:13.27+00:00

Hello,

I have a RAG chat app based on this GitHub repository from Microsoft.

I'm currently using these services and tiers:

  • Azure AI Search - Standard Tier
  • Azure OpenAI - Standard tier with a GPT 4o model deployment with a rate limit of 450000 tokens/minute
  • Azure Cosmos DB - free tier (used for chat history)
  • App Service - Basic B1 tier.

I want to scale my app, making it capable of sustaining 9000 users.

However, I'm not sure about which tiers to choose. What would be the best solution, and why?

I'm almost sure the 9000 users will be using the app simultaneously, but it could be that an elevated amount of users can be using it and I would like to avoid bottlenecks or compromise the quality of the solution!

I've also read a bit about auto-scaling, but I'm unsure if this would be the best solution.

Thank you in advance for supporting!

Best regards,

Tomás

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,167 questions
Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
3,596 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Sina Salam 16,766 Reputation points
    2025-01-31T19:33:58.0466667+00:00

    Hello Tomás Novo,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    Regarding your explanation and consideration to use auto-scaling, I understand that you want to scale the app to sustain 9000 simultaneous users without bottlenecks or compromising quality.

    This is a solution architecture perspective to handle your new request:

    1. For Azure AI Search, upgrade to the Standard S3 tier. This tier supports up to 36 search units (SUs) with 12 partitions and 12 replicas, providing the necessary capacity for high query volumes - https://learn.microsoft.com/en-us/azure/search/search-limits-quotas-capacity and https://learn.microsoft.com/en-us/azure/search/search-sku-tier
    2. For Azure OpenAI, ensure the rate limit of 450,000 tokens per minute is sufficient. If not, consider deploying additional instances or upgrading to a higher tier. For 9000 simultaneous users, you might need to scale horizontally by adding more instances of the GPT-4o model - https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models and https://learn.microsoft.com/en-us/azure/ai-services/openai/quotas-limits
    3. For Azure Cosmos DB, upgrade to a Standard provisioned throughput model with autoscale enabled. - https://learn.microsoft.com/en-us/azure/cosmos-db/free-tier and https://azure.microsoft.com/en-us/pricing/details/cosmos-db/autoscale-provisioned
    4. For App Service, upgrade to the Standard S2 tier. This tier provides more CPU, memory, and scaling capabilities, supporting up to 10 instances. You may need to scale out to multiple instances to handle the load - https://learn.microsoft.com/en-us/azure/app-service/manage-scale-up and https://azure.microsoft.com/en-us/pricing/details/app-service/windows/
    5. For Auto-scaling, Implement auto-scaling for all services to manage resource allocation dynamically. Azure provides built-in autoscaling mechanisms that can automatically add or remove resources based on predefined rules or real-time metrics. - https://learn.microsoft.com/en-us/azure/architecture/best-practices/auto-scaling and https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-autoscale-overview

    NOTE:

    Be mindful of the cost implications of upgrading these services. Use the Azure Pricing Calculator to estimate the costs based on your specific usage patterns and requirements.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    0 comments No comments

  2. SriLakshmi C 2,330 Reputation points Microsoft Vendor
    2025-01-31T20:24:36.8166667+00:00

    Hello Tomás Novo,

    Greetings and Welcome to Microsoft Q&A! Thanks for posting the question.

    To scale your RAG chat application efficiently for 9,000 simultaneous users, you need to optimize your Azure AI services for performance and reliability. Follow these these,

    Azure AI Search: Since you are currently on the Standard Tier, you should evaluate the maximum throughput and capacity of this tier. The Standard Tier allows for a certain number of indexes and documents, which may need to be increased depending on the number of users and queries.

    Kindly refer this https://learn.microsoft.com/en-us/azure/search/search-sku-tier.

    Azure OpenAI: Your current rate limit of 450,000 tokens/minute should be assessed against the expected usage. If each user is expected to generate a significant number of tokens, you may need to consider increasing your capacity or deploying additional instances to handle the load effectively.

    Azure Cosmos DB: Since you are on the free tier, it may not provide sufficient throughput for 9000 users. Upgrading to a paid tier will allow for more requests and better performance, especially for chat history retrieval.

    App Service: The Basic B1 tier may not be sufficient for handling 9000 simultaneous users. You might want to consider scaling up to a higher tier, such as the Standard or Premium tiers, which offer better performance and auto-scaling capabilities.

    Auto-Scaling: Implementing auto-scaling can help manage fluctuating loads by automatically adjusting resources based on demand. This can be beneficial in preventing bottlenecks during peak usage times.

    I Hope this helps. Do let me know if you have any further queries.


    If the response helped, please do click Accept Answer and Yes for was this answer helpful.

    Thank you!

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.