共用方式為


Key considerations for designing a GenAI gateway solution

1. Scalability

Scaling Consumers through Request Load Balancing: Enterprises often face challenges in increasing the number of consumers due to TPM and RPM limits when creating a GenAI gateway. Here are some potential situations and solutions for managing this challenge at the GenAI gateway.

1.1 Load Balancing for Multiple Pay-As-You-Go AOAI Instances

Supporting High Consumer Concurrency: To handle numerous LLM requests, segregate consumers into distinct regions. Azure OpenAI quotas are regional, so deploying in multiple regions allows concurrent operation. The GenAI gateway can balance loads by distributing requests across regions. Cross-region deployments may introduce latency. This latency can be mitigated by implementing region affinity, routing requests to the nearest regional deployment, or using benchmarking to identify optimal regions.

For example, consider two scenarios, the first with a single deployment region and the second with deployments in two regions. Since quota is per-region, the overall maximum RPM is higher in the second scenario, as shown below.

Description Single region deployment Multi region deployment
Total TPM limit 240,000 RegionA: 240,000
RegionB: 240,000
RPM enforced per 1000 TPM 6 RegionA: 6
RegionB: 6
Total RPM 1,440 RegionA: 1,440
RegionB: 1,440
Total RPM across all deployments 1,440 2,880

In a multi-region deployment, higher throughput allows processing more concurrent requests. Azure OpenAI evaluates requests in short periods (1 sec or 10 sec) and extrapolates RPM and TPM, throttling overflow requests. Multiple deployments distribute the load across resources, reducing the likelihood of hitting enforced limits.

1.2 Managing Spikes on PTUs with PAYG Endpoints

Enterprises often choose Provisioned Throughput Units (PTUs) with Azure OpenAI (AOAI) for stable and predictable performance over Pay-As-You-Go (PAYG). To manage sudden demand surges, a 'spillover' strategy can be used. Initially, route traffic to PTU-enabled deployments, and if PTU limits are reached, redirect overflow to TPM (Tokens Per Minute)-enabled AOAI endpoints. This strategy ensures all requests are processed.

Here is a detailed write-up and an implementation of this using APIM as a GenAI gateway.

2. Performance Efficiency

2.1 Improving Consumer Latency

Consumer latency is critical when designing a GenAI gateway. It refers to the time taken for a user's request to travel from the client to the gateway, then to Azure OpenAI (AOAI) services, and back. Minimizing this latency ensures a responsive user experience.

One way to reduce latency is by using AOAI Streaming endpoints. AOAI Streaming endpoints allow quicker responses by streaming completions in parts before the full completion is finished. Both OpenAI and AOAI use Server Sent Events (SSE) for streaming.

However, consider the downsides of using streaming. The GenAI gateway must handle SSE streaming, read each chunk of the event, process the "content" portion, stream it back to the application, and close the connection on stream termination.

2.2 Quota Management

AOAI's quota feature assigns rate limits to deployments using Tokens Per Minute (TPM) and Requests Per Minute (RPM). Learn more about AOAI quota management here.

In large enterprises with multiple applications accessing GenAI resources, managing quota distribution is crucial for fair usage and optimized resource allocation. Each application should benchmark its TPM and RPM requirements before integrating with the GenAI gateway. This benchmarking allows the gateway to allocate resources appropriately.

Benchmarking Token Consumption (PTU and PAYG): Use the azure openai benchmark tool to perform benchmarking of the application. Go over their guidelines to perform the benchmarking.

Here are some suggestions for the approaches for managing quota at a consumer level.

  • Setting up dedicated endpoints for consumers

    For a limited number of consumers, assign dedicated endpoints to individual consumers or groups with similar requirements. Configure the GenAI Gateway to route traffic based on the consumer's identity. This approach is effective for managing a smaller consumer base.

    Quota distribution is set at endpoint creation and requires continuous monitoring to ensure efficient utilization. Some consumers may underutilize their resources while others may experience shortages, leading to inefficient consumption. Regular assessment and reallocation of quotas are necessary to maintain optimal resource usage.

    Refer to the best practices for setting up Multi-tenancy for Azure OpenAI for deployment configuration.

  • Assign rate limits at the consumer level

    An alternative approach is to apply rate limits at the consumer level in the GenAI Gateway. If a consumer surpasses their limit, the GenAI Gateway can:

    • Restrict access to the GenAI resource until the quota is replenished

    • Degrade the consumer experience based on the defined contract

      This access restriction eliminates the need for deployment separation at the Azure OpenAI level. Consumers can implement retry logic for better resiliency.

      Additionally, a GenAI Gateway can enforce rate limiting best practices at the consumer level. These practices ensure consumers set max_tokens and a small best_of value to avoid draining tokens.

2.3 Consumer-Based Request Prioritization

Multiple consumers with varying priorities may try to access AOAI deployments. Since AOAI imposes hard constraints on token consumption per second, request prioritization must ensure critical workloads access GenAI resources first.

Requests can be categorized by priority. Low-priority requests can be queued until capacity becomes available. Continuously monitor AOAI resources to track available capacity. As capacity becomes available, an automated process can execute queued requests. Different approaches to monitor PTU capacity are discussed here.

Leveraging Circuit-Breaker technique to prioritize requests:

The Circuit-Breaker technique in an API gateway can prioritize requests during peak loads. By designating certain consumers as prioritized and others as non-prioritized, the gateway monitors backend response codes. When the backend returns response codes like 429 (quota reached), the circuit-breaker triggers a longer break for non-prioritized consumers, temporarily halting their requests to reduce backend stress. For prioritized consumers, the break is shorter, ensuring quicker resumption of service for critical processes.

3. Security and Data Integrity

3.1 Authentication

Azure OpenAI (AOAI) supports two forms of authentication.

  • Microsoft Entra ID: AOAI supports Microsoft Entra authentication with managed identities or Entra application. Managed identities enable keyless authentication between the consumer and AOAI. Entra application-based authentication requires each consumer to maintain a client ID and secret for the Entra app with access to AOAI resources.

  • API Key: Consumers can use a secret key to authenticate with AOAI. API keys are secret and must be managed by the consumer. Distributing requests across multiple AOAI endpoints requires managing each API key independently. This increases the risk of security breaches if a key is compromised and complicates key management practices like rotation and blocking specific consumers.

The Gateway might interface with endpoints that are not AOAI, which could have different authentication methods.

A suggested approach is to offload authentication to AOAI (or other GenAI endpoints) to the GenAI Gateway and terminate consumer authentication at the Gateway level. This approach decouples GenAI endpoint authentication from end consumers, allowing the use of a uniform enterprise-wide authentication mechanism like OAuth. It also mitigates the risks mentioned earlier. For AOAI endpoints, the GenAI Gateway can use managed identity for authentication. When authentication is offloaded, the GenAI resource cannot recognize individual consumers, as the Gateway uses its own credentials for all requests.

3.2 Personally Identifiable Information (PII) and Data Masking

The GenAI gateway acts as a broker between the consumer and backend AOAI services. Using the GenAI gateway for PII detection and data masking is crucial. This setup allows:

  • Centralized handling of sensitive data
  • Ensuring personal information is identified and managed before processing by AOAI

A centralized approach standardizes PII handling practices across multiple consumer applications, leading to consistent and maintainable data privacy protocols.

Automated processes at the Gateway level can intercept requests and detect PII before processing by Azure OpenAI services. Once detected, PII data can be redacted or replaced with generic placeholders.

  • Detecting PII

    • Services such as Azure AI Language can be used for identifying and categorizing PII information in text data. Azure Purview can also help in detecting and surfacing PII information.

    • Microsoft Presidio can be used for fine-grain control over identification and anonymization of PII data.

    • For more specific or customized PII detection, a custom domain-specific ML model can be trained using Azure Machine Learning service. A REST endpoint exposes the model.

However, integrating an extra layer for PII detection and masking can increase the overall response latency for the consumers. This factor must be balanced against the need for data privacy and compliance when designing the system.

3.3 Data Sovereignty

Data sovereignty in the context of AOAI refers to the legal and regulatory requirements related to the storage and processing of data within the geographic boundaries of a specific country or region. Planning for data sovereignty is critical for a business to avoid non-compliance with local data protection laws resulting in hefty fines.

The GenAI gateway can play a crucial role in data sovereignty by utilizing region affinity based on the consumer's location. It can intelligently redirect traffic to backend AOAI instances and other cloud services for processing requests. The request will be located in regions that comply with the relevant data residency and sovereignty laws. In a hybrid setup that combines on-premises custom Large Language Models (LLMs) with AOAI, it is essential to ensure that the hybrid system also adheres to multi-region availability requirements to support consumer affinity.

3.4 Content Moderation

With the rise of LLM-based chat applications, organizations must prevent users from disclosing sensitive data to externally hosted LLMs. Similarly, the response data from LLMs must be screened to exclude any profanity.

The GenAI gateway design allows enterprises to implement a centralized content moderation strategy for their GenAI applications. For Azure OpenAI, default content filtering occurs within Azure, and enterprises can configure the level of content moderation within Azure.

For more content moderation needs, integrate the GenAI gateway with a content moderation service. Here are some suggestions:

  • Azure Content Moderator Service: Scans text, image, and video content for potential risky, offensive, or undesirable aspects.

  • AI Content Safety: Detects harmful user-generated and AI-generated content in applications and services. Includes text and image APIs for detecting harmful material. Enabled by default when OpenAI is deployed.

Refer to this document for details.

4. Operational Excellence

4.1 Context Length and Modality

Context length is the number of input tokens that the model can handle. LLMs are rapidly evolving to support longer context lengths, resulting in larger request bodies. Additionally, some models can handle different data modes and produce varied data types like images and videos.

The GenAI Gateway design must account for these advancements. It should efficiently manage large, mixed-content requests and support diverse output types, ensuring versatility and robustness in handling complex LLM functionalities.

4.2 Monitoring and Observability

Monitoring and Observability are essential for creating robust and fault-tolerant systems. When building a GenAI gateway, it is key to measure and monitor the overall performance. The overall performance includes tracking various facets such as:

  • Error rates
  • Total time for requests and responses
  • Latency introduced by the gateway layer
  • Latency due to cross-region calls between the gateway and AOAI instances

Azure OpenAI Metrics via Azure Monitor: The Azure OpenAI service default metrics are available via Azure Monitor. Using these default metrics allows downstream systems (for example, GenAI gateway) to:

  • Perform custom operations
  • Build dashboards
  • Set up alerts

However, consider the latency involved with Azure Monitor as it is crucial for real-time monitoring and decision-making processes.

Generating Custom Metrics and Logs via GenAI Gateway: Enterprises may need more information beyond AOAI metrics, such as capturing gateway-induced latency and custom business metrics. This information may be needed on a real-time or near-real-time basis by downstream systems.

Suggested approaches for monitoring and observability using GenAI gateway:

  • Emitting Custom Events to Real-Time Messaging System: The GenAI gateway can intercept requests or responses, extract relevant information, and create events. These events can be pushed asynchronously into real-time messaging systems like Kafka and Azure EventHub. These events can be consumed by a streaming event aggregator (for example, Azure Stream Analytics) to:

    • Populate a data store
    • Provide data for dashboards
    • Trigger actions based on certain rules
  • Emitting Custom Metrics to a Metrics Collector: The GenAI gateway can emit custom metrics to support specific business needs to a metrics collector (with a time-series database). The metric collector can power dashboards, alerts, and other custom functionalities. Azure Monitor offers mechanisms for emitting and collecting custom metrics. Open-source alternatives like Prometheus can also be implemented, as described in this post.

It's essential to understand that these custom metrics differ significantly from metrics generated by the AOAI service. Hence, a careful assessment of when to use which metrics is crucial.

4.3 Using Hybrid LLMs

The GenAI gateway in an enterprise acts as a frontend for all GenAI deployments. It covers both Azure OpenAI and custom LLM deployments either on On-Premises Datacenters or on other cloud providers.

Accessing these differently hosted LLMs may vary in multiple aspects:

  • Consumer authentication
  • Emitted metrics
  • Quota management
  • Latency requirements
  • Content moderation approaches

Hence, while designing the GenAI gateway, it's crucial to understand the organization's hybrid strategy considering the above-mentioned aspects. This understanding will dictate how the gateway interfaces with various LLMs and other hybrid services, ensuring efficient and secure access while meeting specific operational requirements.

4.4 Model version management

In the rapidly evolving landscape of LLMs, the capability to seamlessly transition between model versions is crucial for rapid experimentation, swift adoption of performance improvements, or security upgrades.

The GenAI gateway should support Model Version Management, enabling smooth integration of new LLM versions while maintaining operational continuity for consumer applications.

The gateway should facilitate key Model version management features, such as:

Testing and Rollout: Execute a comprehensive test suite to ensure performance, reliability, and compatibility of new LLM versions within the existing ecosystem before a broader rollout. The gateway must support these testing requirements by exposing test-specific endpoints and facilitating a controlled rollout to a subset of consumers.

Ease of version upgrades and rollbacks: The gateway must have mechanisms to quickly roll-forward to newer, stable versions or roll back to previous versions in response to any critical issues that may arise post-deployment.

4.5 Resilience and Fault Tolerance

Resilience and fault tolerance are critical aspects of any GenAI gateway design. The gateway should be designed to handle failures gracefully and ensure minimal disruption to consumer applications. The following are some key considerations for building a resilient and fault-tolerant GenAI gateway:

  • Backoff and Retry Mechanisms: Implementing backoff and retry mechanisms in the gateway can help manage transient failures and reduce the impact of service disruptions. The gateway should be able to intelligently retry requests based on the type of error and the current load on the system.
  • Backup Models and Fallback Strategies: The gateway should have the ability to switch to backup models or fallback strategies if there are model failures or service outages. This strategy ensures that consumer applications can continue to function even when primary models are unavailable.
  • Regional Fail-over: The gateway should be designed to support regional failover to ensure high availability and reliability. In the event of a regional outage, the gateway should be able to redirect traffic to alternative regions to minimize downtime.

5. Cost Optimization

5.1 Effective Utilization of PTUs

The Azure OpenAI Sizing tool helps enterprises plan their Azure OpenAI (AOAI) capacity based on their requirements. Procuring Provisioned Throughput Units (PTUs) provides predictable performance but requires advance payment and reservation of AOAI quotas. Underutilized reserved capacity can lead to inefficient resource allocation and financial overhead.

To mitigate this inefficiency, consider the following approaches:

Spillover Strategy to Control Costs: Utilize pre-purchased PTUs first, then route excess traffic to Pay-As-You-Go (PAYG) endpoints. This strategy allows a lower PTU capacity to be used.

Effective PTU Consumption: Separate consumers into real-time and batch (scheduled/on-demand) categories. Apply monitoring to ensure batch consumers use PTUs only when underutilized. A detailed approach is available here.

5.2 Tracking Resource Consumption at Consumer Level

In a large enterprise setup, operational costs are shared among different business units through a charge-back model. For GenAI resources, this tracking involves:

  • Measuring consumption per consumer for both PTU (Reserved capacity) and TPMs (Pay-as-you-go) quota
  • Providing transparent cost reporting, quota allocation vs. consumed reporting, and cost attribution functionalities

Consumption Tracking in AOAI

The approach for consumption tracking depends on the mode of interaction with AOAI services.

Batch Processing Mode:

  • Send a set of inputs all at once
  • Receive outputs after the model processes the entire batch

Usage information in the response body includes the total number of tokens consumed:

"usage": {
  "prompt_tokens": 14,
  "completion_tokens": 436,
  "total_tokens": 450
}

Streaming Mode:

In streaming mode, AOAI does not return usage statistics in the response. To count tokens:

  • Measure prompt tokens: Calculate from the request using a library like tiktoken.
  • Measure completion tokens: Count the number of events in the stream while iterating and streaming the response.

Total tokens are the sum of prompt and completion tokens. This count is an approximation, as each chunk of the response may not correspond to a single token.