GenAI gateway reference architecture using APIM
This section presents the reference architectures of a GenAI gateway for an enterprise that needs to access both Azure OpenAI (AOAI) resources and the custom LLM deployments on their own premises. There could be many possible ways to design a GenAI gateway using a combination of various Azure services. This section demonstrates using Azure API Management (APIM) Service as the main component to build the necessary features for a GenAI gateway solution.
Reference Architectures using Azure API Management
The Azure API Management(APIM) Landing Zone accelerator provides a comprehensive solution to deploy a GenAI gateway using Azure API Management with best practices around security and operational excellence. GenAI gateway using APIM is one of the reference scenario implemented in this accelerator.
Cloud-based GenAI Gateway
This design shows how to use APIM to create a GenAI gateway. It smoothly integrates with AOAI services in the cloud and any on-premises custom LLMs that are deployed and available as REST endpoints. The architecture incorporates elements that are engineered for batch use cases, with the aim of optimizing PTU utilization described here.
Figure 1: Cloud-Based GenAI using APIM |
APIM products and subscription features can enable various Generative AI scenarios in an enterprise. Different products can offer the following different functionalities:
- Creating content.
- Producing embeddings.
- Searching.
Subscriptions allow different teams to access these functionalities.
Considerations for the cloud based approach: Keep in mind that the gateway component is cloud-based, meaning the Azure network processes every request before applying gateway policies. This can increase latency for on-premises services. Additionally, ensure proper network setup for inbound connections if LLM models are deployed on-premises.
On-premises GenAI Gateway using APIM Self-Hosted Gateways
Many enterprises would like to use existing in-house capabilities while also having network constraints to allow inbound connection from Azure to their internal network.
Azure API Management (APIM) self-hosted gateways can be used to create a GenAI gateway that seamlessly integrates with AOAI services and on-premises applications. The self-hosted APIM gateway acts as a crucial component, bridging AOAI services with the enterprise's internal network.
Figure 2: On-premises Self-Hosted APIM Gateway |
With APIM self-hosted gateway, the requests from the enterprise's internal network stay within the network unless they reach out to the AOAI resource. This approach enables all the features of the gateway inside the network and eliminates the need for inbound connection from the cloud.
The gateway can use any existing on-premises deployment of queue for scheduling requests and connect with enterprise-wide monitoring system. This queue would enable gateway logs and metrics to be combined with existing consumer application logs and metrics.
Considerations for the on-premises approach: Organizations must deploy and maintain the self-hosted gateway, ensuring it scales horizontally to handle load and remains elastic for request surges. If using a custom metrics store, they must build their own monitoring and alerting solutions to support the following actions:
- Dynamic scheduling of request
- Making reports for charge back.
Reference Design for Key Individual Capabilities
The following outlines the reference design for key GenAI gateway capabilities using Azure API Management (APIM) as the foundational technology.
1. Scalability
The Premium tier of APIM provides the capability to extend a single APIM instance across multiple Azure regions.
1.1 Supporting High Consumer Concurrency
A single Premium Tier APIM service instance is equipped to do the following actions:
- Support multi-region.
- Support Multi-Azure AOAI account configurations.
- Facilitate efficient traffic routing across various regions.
- Ensure support for high consumer concurrency.
The below diagram illustrates this setup, where APIM efficiently routes traffic to multiple AOAI instances, deployed in distinct regions. This capability enhances the performance and availability of the service by using geographical distribution of resources.
More info about multiple regions.
Figure 1: Handling High Consumer Concurrency |
Scenario: Managing spikes with Provisioned Throughput Units (PTUs) and Pay As You Go (PAYG) endpoints
This diagram shows the implementation of spillover strategy. This strategy involves initially routing traffic to PTU-enabled deployments. In cases where PTU limits are reached, the overflow is redirected to TPM (Tokens Per Minute)-enabled Azure OpenAI (AOAI) endpoints. This redirection ensures all requests are processed.
More information Scaling (Single Region).
Figure 2: Managing Spikes on PTUs with PAYG |
1.2 Load Balancing across Multiple AOAI Instances
API Management supports backend pools, when you want to implement multiple backends for an API and load-balance requests across those backends..
An alternate load-balancing strategy can be implemented by authoring custom policies within APIM. Refer to this implementation of such strategy using custom APIM policies.
2. Performance Efficiency
APIM policies can be used to rate limit based on RPM and TPM.
2.1 Quota Management for Consumers
Different rate limit values can be set for different use cases based on their subscription IDs. In the below policy snippet, rate limiting is done based on both RPM and TPM. Throttling is expected when either of these limits is crossed.
2.1.1 Rate Limit based on TPM consumption
The Azure OpenAI token limit policy, allows implementation of throttling based on the Tokens Per Minute (TPM) consumption.
2.1.2 Retries for increased service availability
There will be scenarios, where the AOAI responds back with 429s if the TPM is exceeded for the specific deployment. To mitigate the AOAI quota limits, Retries become an essential tool to ensure service availability. The request throttling happens for a window of a few seconds or minutes. A retry strategy with exponential back-offs can be implemented at the Gateway layer. This strategy will ensure that the request is served for the consumers.
Refer to this link for Sample APIM Policies for AOAI
3. Security and data integrity
Managed identities are used to protect PII.
3.1 Authentication
APIM with AOAI provides several options for authentication including the following:
- API keys.
- Managed identities.
- Service principal.
The managed identity approach can be used to authenticate between APIM and managed identity supported backend Azure services
The managed identity can be given the right access, "Azure AI Service User" for the AOAI instance as mentioned in How to configure OpenAI. APIM then transparently authenticates to the backend, that is, AOAI.
3.2 PII and data masking
This diagram shows how PII detection and data masking are enabled using GenAI Gateway. Upon receiving a data request, the information is sent to an Azure function for PII detection. This function can use services like PII detection in Azure AI Language, Microsoft Presidio, or a custom machine learning model to identify PII data. The detected information is then used to mask the request. The masked data is forwarded to Azure APIM, which then sends it to AOAI.
Figure 3: PII and Data Masking |
3.3 Data Sovereignty
This diagram shows how data is restricted to customer-specific regions using GenAI Gateway. Each region hosts AI-enabled applications, APIM, and AOAI. Traffic is routed to region-specific APIM and OpenAI using Traffic Manager. APIM routes requests to the region-specific Azure OpenAI instance.
To learn more about APIM multi-regional deployment, refer to Deploy Azure API management to multiple Azure regions.
Figure 4: Data Sovereignty via Multiple APIMs |
Figure 5: Data Sovereignty via multi-instance APIM |
4. Operational Excellence
4.1 Monitoring and Observability
Azure Monitor Integration
With APIM's native integration with Azure Monitor, requests, responses (payload), and APIM metrics can be logged into Azure Monitor. Additionally, Azure Monitor can collect and log metrics from other Azure services like AOAI, making it the default choice for monitoring and observability.
Figure 6: Monitoring using Azure Monitor |
Azure Monitor provides a low-code/no-code way of generating insights, but it has some limitations:
Latency can range from 30 seconds to 15 minutes, which is significant for real-time monitoring and decision-making.
Capturing request/response payloads requires configuring the sampling rate in APIM. A high sampling rate can impact APIM throughput and increase latency.
Large payloads may not be fully logged due to log size limitations. Azure Monitor has a log size limit of 32KB (9182 bytes), and if the combined size of all logged headers and payloads exceeds this limit, some logs may not be recorded.
Monitoring via Custom Events
Figure 7: Monitoring using Custom Events |
In this approach, requests, responses, and other data from Azure API Management (APIM) can be logged as custom events to a messaging system like Event Hubs. The event stream from Event Hubs can be consumed by other services for near-real-time data aggregation, generating alerts, or performing other actions.
While this approach offers a more near real-time experience, it requires writing custom aggregation services.
5. Cost Optimization
5.1 Tracking Consumption
Emit token metric policy allows users to track the token consumption of AOAI services by emitting the Total Tokens, Prompt Tokens, and Completion Tokens
as a custom metric to Application Insights. This metric can be aggregated to generate reports for internal charge-back of the consumers and supports both streaming and non-streaming AOAI responses.