Continuously monitor your generative AI applications
Important
Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
Continuous advancements in Generative AI have led organizations to build increasingly complex applications to solve various problems (chat-bots, RAG systems, agentic systems, etc.). These applications are being used to drive innovation, improve customer experiences, and enhance decision-making. Although the models (for example, GPT-4) powering these Generative AI applications are extremely capable, continuous monitoring has never been more important to ensure high-quality, safe, and reliable results. Continuous monitoring is effective when multiple perspectives are considered when observing an application. These perspectives include token usage and cost, operational metrics – latency, request count, etc. - and, importantly, continuous evaluation. To learn more about evaluation, see Evaluation of generative AI applications.
Azure AI and Azure Monitor provide tools for you to continuously monitor the performance of your Generative AI applications from multiple perspectives. With Azure AI Online Evaluation, you can continuously evaluate your application agnostic of where it's deployed or what orchestration framework it's using (for example, LangChain). You can use various built-in evaluators which maintain parity with the Azure AI Evaluation SDK or define your own custom evaluators. By continuously running the right evaluators over your collected trace data, your team can more effectively identify and mitigate security, quality, and safety concerns as they arise, either in pre-production or post-production. Azure AI Online Evaluation provides full integration with the comprehensive suite of observability tooling available in Azure Monitor Application Insights, enabling you to build custom dashboards, visualize your evaluation results over time, and configure alerting for advanced application monitoring.
In summary, monitoring your generative AI applications has never been more important, due to the complexity and rapid evolvement of the AI industry. Azure AI Online Evaluation, integrated with Azure Monitor Application Insights, enables you to continuously evaluate your deployed applications to ensure that they're performant, safe, and produce high-quality results in production.
How online evaluation works
In this section, you'll learn how Azure AI Online Evaluation works, how it integrates with Azure Monitor Application Insights, and how you can use it to run continuous evaluations over trace data from your generative AI applications.
Tracing your generative AI application
The first step in continuously monitoring your application is to ensure that its telemetry data is captured and stored for analysis. To accomplish this, you'll need to instrument your generative AI application’s code to use the Azure AI Tracing package to log trace data to an Azure Monitor Application Insights resource of your choice. This package fully conforms with the OpenTelemetry standard for observability. After you have instrumented your application's code, the trace data will be logged to your Application Insights resource.
After you have included tracing in your application code, you can view the trace data in Azure AI Foundry or in your Azure Monitor Application Insights resource. To learn more about how to do this, see monitor your generative AI application.
Online Evaluation
After your application is instrumented to send trace data to Application Insights, it’s time to set up an Online Evaluation schedule to continuously evaluate this data. Azure AI Online Evaluation is a service that uses Azure AI compute to continuously run a set of evaluators. After you have set up an Online Evaluation schedule with the Azure AI Project SDK, it runs on a customizable schedule. Each time the service runs, it performs the following steps:
- Query application trace data from the connected Application Insights resource using provided Kusto query.
- Run each evaluator over the trace data and calculate each metric (for example, groundedness: 3).
- Write evaluation scores back to each trace using standardized semantic conventions.
Note
Azure AI Online Evaluation supports the same metrics as Azure AI Evaluation. For more information on how evaluation works and which evaluation metrics are supported, see Evaluate your Generative AI application with the Azure AI Evaluation SDK
For example, let’s say you have a deployed chat application that receives many customer questions on a daily basis. You want to continuously evaluate the quality of the responses from your application. You set up an Online Evaluation schedule with a daily recurrence. You configure the evaluators: Groundedness, Coherence, and Fluency. Every day, the service computes the evaluation scores for these metrics and writes the data back to Application Insights for each trace that was collected during the recurrence time window (in this example, the past 24 hours). Then, the data can be queried from each trace and made accessible in Azure AI Foundry and Azure Monitor Application Insights.
The evaluation results written back to each trace within Application Insights follow the following conventions. A unique span will be added to each trace for each evaluation metric.
Property | Application Insights Table | Fields for a given operation_ID | Example value |
---|---|---|---|
Evaluation metric | traces, AppTraces | customDimensions[“event.name”] |
gen_ai.evaluation.relevance |
Evaluation metric score | traces, AppTraces | customDimensions[“gen_ai.evaluation.score”] |
3 |
Evaluation metric comment (if applicable) | traces, AppTraces | message |
{“comment”: “I like the response”} |
Now that you understand how Azure AI Online Evaluation works and how it connects to Azure Monitor Application Insights, it’s time to learn how to set up the service.
Set up Online Evaluation
In this section, you'll learn how to configure an Online Evaluation schedule to continuously monitor your deployed generative AI application. Azure AI Project SDK offers such capabilities via a Python API and supports all of the features available in local evaluations. Use the following steps to submit your Online Evaluation schedule on your data using built-in or custom evaluators.
Note
Evaluations are only supported in the same regions as AI-assisted risk and safety metrics.
Prerequisites
Complete the following prerequisite steps to set up your environment and authentication to the necessary resources:
- An Azure Subscription.
- A Resource Group in an Evaluation-supported region.
- A new User-assigned Managed Identity in the same resource group and region. Make a note of the
clientId
; you'll need it later. - An Azure AI Hub in the same resource group and region.
- An Azure AI project in this hub, see Create a project in Azure AI Foundry portal.
- An Azure Monitor Application Insights resource.
- Navigate to the hub page in Azure portal and add Application Insights resource, see Update Azure Application Insights and Azure Container Registry.
- Azure OpenAI Deployment with GPT model supporting
chat completion
, for examplegpt-4
. Connection String
for Azure AI project to easily createAIProjectClient
object. You can get the Project connection string under Project details from the project's Overview page.- Navigate to your Application Insights resource in the Azure portal and use the Access control (IAM) tab to add the
Log Analytics Contributor
role to the User-assigned Managed Identity you created previously. - Attach the User-assigned Managed Identity to your project.
- Navigate to your Azure AI Services in the Azure portal and use the Access control (IAM) tab to add the
Cognitive Services OpenAI Contributor
role to the User-assigned Managed Identity you created previously. - Make sure you're first logged into your Azure subscription by running
az login
.
Installation Instructions
- Create a virtual environment of your choice. To create one using conda, run the following command:
conda create -n online-evaluation
conda activate online-evaluation
- Install the required packages by running the following command:
pip install azure-identity azure-ai-projects azure-ai-ml
Tip
Optionally, you can pip install azure-ai-evaluation
if you want a code-first experience to fetch evaluator id for built-in evaluators in code. To learn how to do this, see Specifying evaluators from evaluator library.
Set up tracing for your generative AI application
The first step in monitoring your application is to set up tracing. To learn how to do so such that data is logged to Application Insights, see set up tracing for your generative AI application.
Using service name in trace data
To identify your service via a unique ID in Application Insights, you can use the service name OpenTelemetry property in your trace data. This is particularly useful if you're logging data from multiple applications to the same Application Insights resource, and you want to differentiate between them. For example, lets say you have two applications: App-1 and App-2, with tracing configured to log data to the same Application Insights resource. Perhaps you'd like to set up App-1 to be evaluated continuously by Relevance and App-2 to be evaluated continuously by Groundedness. You can use the service name to differentiate between the applications in your Online Evaluation configurations.
To set up the service name property, you can do so directly in your application code by following the steps, see Using multiple tracer providers with different Resource. Alternatively, you can set the environment variable OTEL_SERVICE_NAME
prior to deploying your app. To learn more about working with the service name, see OTEL Environment Variables and Service Resource Semantic Conventions.
To query trace data for a given service name, query for the cloud_roleName
property. Add the following line to the KQL query you use within your Online Evaluation set-up:
| where cloud_RoleName == "service_name"
Query stored trace data in Application Insights
Using the Kusto Query Language (KQL), you can query your generative AI application's trace data from Application Insights to use for continuous Online Evaluation. If you use the Azure AI Tracing package to trace your generative AI application, you can use the following Kusto query to view the data in Application Insights:
Important
The KQL query used by the Online Evaluation service must output the following columns: operation_Id
, operation_ParentId
, and gen_ai_response_id
. Additionally, each evaluator has its own input data requirements. The KQL query must output these columns to be used as inputs to the evaluators themselves. For a list of data requirements for evaluators, see data requirements for built-in evaluators.
let gen_ai_spans = (
dependencies
| where isnotnull(customDimensions["gen_ai.system"])
| extend response_id = tostring(customDimensions["gen_ai.response.id"])
| project id, operation_Id, operation_ParentId, timestamp, response_id
);
let gen_ai_events = (
traces
| where message in ("gen_ai.choice", "gen_ai.user.message", "gen_ai.system.message")
or tostring(customDimensions["event.name"]) in ("gen_ai.choice", "gen_ai.user.message", "gen_ai.system.message")
| project
id = operation_ParentId,
operation_Id,
operation_ParentId,
user_input = iff(
message == "gen_ai.user.message" or tostring(customDimensions["event.name"]) == "gen_ai.user.message",
parse_json(iff(message == "gen_ai.user.message", tostring(customDimensions["gen_ai.event.content"]), message)).content,
""
),
system = iff(
message == "gen_ai.system.message" or tostring(customDimensions["event.name"]) == "gen_ai.system.message",
parse_json(iff(message == "gen_ai.system.message", tostring(customDimensions["gen_ai.event.content"]), message)).content,
""
),
llm_response = iff(
message == "gen_ai.choice",
parse_json(tostring(parse_json(tostring(customDimensions["gen_ai.event.content"])).message)).content,
iff(tostring(customDimensions["event.name"]) == "gen_ai.choice", parse_json(parse_json(message).message).content, "")
)
| summarize
operation_ParentId = any(operation_ParentId),
Input = maxif(user_input, user_input != ""),
System = maxif(system, system != ""),
Output = maxif(llm_response, llm_response != "")
by operation_Id, id
);
gen_ai_spans
| join kind=inner (gen_ai_events) on id, operation_Id
| project Input, System, Output, operation_Id, operation_ParentId, gen_ai_response_id = response_id
Optionally, you can use the sample operator or take operator in your Kusto query such that it only returns a subset of traces. Since AI-assisted evaluations can be costly at scale, this approach can help you control costs by only evaluating a random sample (or n
traces) of your data.
Set up Online Evaluation with Azure AI Project SDK
You can submit an Online Evaluation scheduled job with the Azure AI Project SDK via a Python API. See the below script to learn how to set up Online Evaluation with performance and quality (AI-assisted) evaluators. To view a comprehensive list of supported evaluators, see Evaluate with the Azure AI Evaluation SDK. To learn how to use custom evaluators, see custom evaluators.
Start by importing the required packages and configuring the required variables:
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
from azure.ai.projects.models import (
ApplicationInsightsConfiguration,
EvaluatorConfiguration,
EvaluationSchedule,
RecurrenceTrigger,
)
from azure.ai.evaluation import CoherenceEvaluator
# This sample includes the setup for an online evaluation schedule using the Azure AI Project SDK and Azure AI Evaluation SDK
# The schedule is configured to run daily over the collected trace data while running two evaluators: CoherenceEvaluator and RelevanceEvaluator
# This sample can be modified to fit your application's requirements
# Name of your online evaluation schedule
SAMPLE_NAME = "online_eval_name"
# Name of your generative AI application (will be available in trace data in Application Insights)
SERVICE_NAME = "service_name"
# Connection string to your Azure AI Foundry project
# Currently, it should be in the format "<HostName>;<AzureSubscriptionId>;<ResourceGroup>;<HubName>"
PROJECT_CONNECTION_STRING = "<HostName>;<AzureSubscriptionId>;<ResourceGroup>;<HubName>"
# Your Application Insights resource ID
APPLICATION_INSIGHTS_RESOURCE_ID = "appinsights_resource_id"
# Kusto Query Language (KQL) query to query data from Application Insights resource
# This query is compatible with data logged by the Azure AI Inferencing Tracing SDK (linked in documentation)
# You can modify it depending on your data schema
# The KQL query must output these required columns: operation_ID, operation_ParentID, and gen_ai_response_id
# You can choose which other columns to output as required by the evaluators you are using
KUSTO_QUERY = "let gen_ai_spans=(dependencies | where isnotnull(customDimensions[\"gen_ai.system\"]) | extend response_id = tostring(customDimensions[\"gen_ai.response.id\"]) | project id, operation_Id, operation_ParentId, timestamp, response_id); let gen_ai_events=(traces | where message in (\"gen_ai.choice\", \"gen_ai.user.message\", \"gen_ai.system.message\") or tostring(customDimensions[\"event.name\"]) in (\"gen_ai.choice\", \"gen_ai.user.message\", \"gen_ai.system.message\") | project id= operation_ParentId, operation_Id, operation_ParentId, user_input = iff(message == \"gen_ai.user.message\" or tostring(customDimensions[\"event.name\"]) == \"gen_ai.user.message\", parse_json(iff(message == \"gen_ai.user.message\", tostring(customDimensions[\"gen_ai.event.content\"]), message)).content, \"\"), system = iff(message == \"gen_ai.system.message\" or tostring(customDimensions[\"event.name\"]) == \"gen_ai.system.message\", parse_json(iff(message == \"gen_ai.system.message\", tostring(customDimensions[\"gen_ai.event.content\"]), message)).content, \"\"), llm_response = iff(message == \"gen_ai.choice\", parse_json(tostring(parse_json(tostring(customDimensions[\"gen_ai.event.content\"])).message)).content, iff(tostring(customDimensions[\"event.name\"]) == \"gen_ai.choice\", parse_json(parse_json(message).message).content, \"\")) | summarize operation_ParentId = any(operation_ParentId), Input = maxif(user_input, user_input != \"\"), System = maxif(system, system != \"\"), Output = maxif(llm_response, llm_response != \"\") by operation_Id, id); gen_ai_spans | join kind=inner (gen_ai_events) on id, operation_Id | project Input, System, Output, operation_Id, operation_ParentId, gen_ai_response_id = response_id"
Next, define a client and an Azure OpenAI GPT deployment (such as GPT-4
) which will be used to run your Online Evaluation schedule. Also, connect to your Application Insights resource:
# Connect to your Azure AI Foundry Project
project_client = AIProjectClient.from_connection_string(
credential=DefaultAzureCredential(),
conn_str=PROJECT_CONNECTION_STRING
)
# Connect to your Application Insights resource
app_insights_config = ApplicationInsightsConfiguration(
resource_id=APPLICATION_INSIGHTS_RESOURCE_ID,
query=KUSTO_QUERY,
service_name=SERVICE_NAME
)
# Connect to your AOAI resource, you must use an AOAI GPT model
deployment_name = "gpt-4"
api_version = "2024-08-01-preview"
# This is your AOAI connection name, which can be found in your AI Foundry project under the 'Models + Endpoints' tab
default_connection = project_client.connections._get_connection(
"aoai_connection_name"
)
model_config = {
"azure_deployment": deployment_name,
"api_version": api_version,
"type": "azure_openai",
"azure_endpoint": default_connection.properties["target"]
}
Next, configure the evaluators you wish to use:
# RelevanceEvaluator
# id for each evaluator can be found in your AI Foundry registry - please see documentation for more information
# init_params is the configuration for the model to use to perform the evaluation
# data_mapping is used to map the output columns of your query to the names required by the evaluator
relevance_evaluator_config = EvaluatorConfiguration(
id="azureml://registries/azureml-staging/models/Relevance-Evaluator/versions/4",
init_params={"model_config": model_config},
data_mapping={"query": "${data.Input}", "response": "${data.Output}"}
)
# CoherenceEvaluator
coherence_evaluator_config = EvaluatorConfiguration(
id=CoherenceEvaluator.id,
init_params={"model_config": model_config},
data_mapping={"query": "${data.Input}", "response": "${data.Output}"}
)
Lastly, define the recurrence and create the schedule:
Note: In the prerequisite steps, you created a User-assigned managed identity to authenticate the Online Evaluation schedule to your Application Insights resource. The AzureMSIClientId
in the properties
parameter of the EvaluationSchedule
class is the clientId
of this identity.
# Frequency to run the schedule
recurrence_trigger = RecurrenceTrigger(frequency="day", interval=1)
# Dictionary of evaluators
evaluators = {
"relevance": relevance_evaluator_config,
"coherence" : coherence_evaluator_config
}
name = SAMPLE_NAME
description = f"{SAMPLE_NAME} description"
# AzureMSIClientId is the clientID of the User-assigned managed identity created during set-up - see documentation for how to find it
properties = {"AzureMSIClientId": "your_client_id"}
# Configure the online evaluation schedule
evaluation_schedule = EvaluationSchedule(
data=app_insights_config,
evaluators=evaluators,
trigger=recurrence_trigger,
description=description,
properties=properties)
# Create the online evaluation schedule
created_evaluation_schedule = project_client.evaluations.create_or_replace_schedule(name, evaluation_schedule)
print(f"Successfully submitted the online evaluation schedule creation request - {created_evaluation_schedule.name}, currently in {created_evaluation_schedule.provisioning_state} state.")
Perform operations on an Online Evaluation schedule
You can get, list, and disable Online Evaluation schedules by adding the following code to your Online Evaluation configuration script:
Warning: Please wait a small amount of time (~30 seconds) between creating an Online Evaluation schedule and running the get_schedule()
API.
Get an Online Evaluation schedule:
name = "<my-online-evaluation-name>"
get_evaluation_schedule = project_client.evaluations.get_schedule(name)
List all Online Evaluation schedules:
count = 0
for evaluation_schedule in project_client.evaluations.list_schedule():
count += 1
print(f"{count}. {evaluation_schedule.name} "
f"[IsEnabled: {evaluation_schedule.is_enabled}]")
print(f"Total evaluation schedules: {count}")
Disable (soft-delete) Online Evaluation schedule:
name = "<my-online-evaluation-name>"
project_client.evaluations.disable_schedule(name)
Monitor your generative AI application
In this section, you'll learn how Azure AI integrates with Azure Monitor Application Insights to give you an out-of-the-box dashboard view that is tailored with insights regarding your generative AI app so you can stay updated with the latest status of your application.
Insights for your generative AI application
If you haven’t set this up, here are some quick steps:
- Navigate to your project in Azure AI Foundry.
- Select the Tracing page on the left-hand side.
- Connect your Application Insights resource to your project.
If you already set up tracing in Azure AI Foundry portal, all you need to do is select the link to Check out your Insights for Generative AI application dashboard.
Once you have your data streaming into your Application Insights resource, you automatically can see it get populated in this customized dashboard.
This view is a great place for you to get started with your monitoring needs.
- You can view token consumption over time to understand if you need to increase your usage limits or do additional cost analysis.
- You can view evaluation metrics as trend lines to understand the quality of your app on a daily basis.
- You can debug when exceptions take place and drill into traces using the Azure Monitor End-to-end transaction details view to figure out what went wrong.
This is an Azure Workbook that is querying data stored in your Application Insights resource. You can customize this workbook and tailor this to fit your business needs. To learn more, see editing Azure Workbooks.
This allows you to add additional custom evaluators that you might have logged or other markdown text to share summaries and use for reporting purposes.
You can also share this workbook with your team so they stay informed with the latest!