Monitor quality and token usage of deployed prompt flow applications

Article
03/10/2025

Important

Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Monitoring applications that are deployed to production is an essential part of the generative AI application lifecycle. Changes in data and consumer behavior can influence your application over time. The changes can result in outdated systems that negatively affect business outcomes and expose organizations to compliance, economic, and reputation risks.

Note

For an improved way to perform continuous monitoring of deployed applications (other than prompt flow), consider using Azure AI online evaluations.

By using Azure AI monitoring for generative AI applications, you can monitor your applications in production for token usage, generation quality, and operational metrics.

Integrations for monitoring a prompt flow deployment allow you to:

Collect production inference data from your deployed prompt flow application.
Apply Responsible AI evaluation metrics such as groundedness, coherence, fluency, and relevance, which are interoperable with prompt flow evaluation metrics.
Monitor prompts, completion, and total token usage across each model deployment in your prompt flow.
Monitor operational metrics, such as request count, latency, and error rate.
Use preconfigured alerts and defaults to run monitoring on a recurring basis.
Consume data visualizations and configure advanced behavior in the Azure AI Foundry portal.

Before you follow the steps in this article, make sure that you have the following prerequisites:

An Azure subscription with a valid payment method. Free or trial Azure subscriptions aren't supported for this scenario. If you don't have an Azure subscription, create a paid Azure account to begin.
An Azure AI Foundry project.
A prompt flow ready for deployment. If you don't have one, see Develop a prompt flow.
Azure role-based access controls are used to grant access to operations in the Azure AI Foundry portal. To perform the steps in this article, your user account must be assigned the Azure AI Developer role on the resource group. For more information on permissions, see Role-based access control in the Azure AI Foundry portal.

Install the Azure Machine Learning SDK for Python.

pip install -U azure-ai-ml

Requirements for monitoring metrics

Generative pretrained transformer (GPT) language models generate monitoring metrics that are configured with specific evaluation instructions (prompt templates). These models act as evaluator models for sequence-to-sequence tasks. Use of this technique to generate monitoring metrics shows strong empirical results and high correlation with human judgment when compared to standard generative AI evaluation metrics. For more information about prompt flow evaluation, see Submit a batch test and evaluate a flow and Evaluation and monitoring metrics for generative AI.

The following GPT models generate monitoring metrics. These GPT models are supported with monitoring and configured as your Azure OpenAI resource:

GPT-3.5 Turbo
GPT-4
GPT-4-32k

Supported metrics for monitoring

The following metrics are supported for monitoring.

Metric	Description
Groundedness	Measures how well the model's generated answers align with information from the source data (user-defined context.)
Relevance	Measures the extent to which the model's generated responses are pertinent and directly related to the given questions.
Coherence	Measures the extent to which the model's generated responses are logically consistent and connected.
Fluency	Measures the grammatical proficiency of a generative AI's predicted answer.

Column name mapping

When you create your flow, you need to ensure that your column names are mapped. The following input data column names are used to measure generation safety and quality.

Input column name	Definition	Required/Optional
Question	The original prompt given (also known as "inputs" or "question").	Required
Answer	The final completion from the API call that's returned (also known as "outputs" or "answer").	Required
Context	Any context data that's sent to the API call, together with the original prompt. For example, if you hope to get search results only from certain certified information sources or websites, you can define this context in the evaluation steps.	Optional

Parameters required for metrics

The parameters that are configured in your data asset dictate what metrics you can produce, according to this table.

Metric	Question	Answer	Context
Coherence	Required	Required	-
Fluency	Required	Required	-
Groundedness	Required	Required	Required
Relevance	Required	Required	Required

For more information on the specific data mapping requirements for each metric, see Query and response metric requirements.

Set up monitoring for a prompt flow

To set up monitoring for your prompt flow application, you first have to deploy your prompt flow application with inferencing data collection. Then you can configure monitoring for the deployed application.

Deploy your prompt flow application with inferencing data collection

In this section, you learn how to deploy your prompt flow with inferencing data collection enabled. For more information on how to deploy your prompt flow, see Deploy a flow for real-time inference.

Sign in to Azure AI Foundry.
If you're not already in your project, select it.
On the left pane, select Prompt flow.
Select the prompt flow that you created previously.

This article assumes that you created a prompt flow that's ready for deployment. If you don't have one, see Develop a prompt flow.
Confirm that your flow runs successfully and that the required inputs and outputs are configured for the metrics that you want to assess.

Supplying the minimum required parameters (question/inputs and answer/outputs) provides only two metrics: coherence and fluency. You must configure your flow as described in the section for Requirements for monitoring metrics. This example uses question (Question) and chat_history (Context) as the flow inputs, and answer (Answer) is used as the flow output.
Select Deploy to begin deploying your flow.
In the deployment window, ensure that Inferencing data collection is enabled to seamlessly collect your application's inference data to Azure Blob Storage. This data collection is required for monitoring.
Proceed through the steps in the deployment window to complete the advanced settings.
On the Review page, review the deployment configuration and select Create to deploy your flow.

By default, all inputs and outputs of your deployed prompt flow application are collected to your blob storage. As users invoke the deployment, the data is collected for your monitor to use.
Select the Test tab on the deployment page. Then test your deployment to ensure that it's working properly.

Monitoring requires that at least one data point comes from a source other than the Test tab in the deployment. We recommend using the REST API available on the Consume tab to send sample requests to your deployment. For more information on how to send sample requests to your deployment, see Create an online deployment.

Configure monitoring

In this section, you learn how to configure monitoring for your deployed prompt flow application.

Studio
Python SDK

On the left pane, go to My assets > Models + endpoints.
Select the prompt flow deployment that you created.
In the Enable generation quality monitoring box, select Enable.
Begin to configure monitoring by selecting the metrics that you want.
Confirm that your column names are mapped from your flow as defined in Column name mapping.
Select the Azure OpenAI Connection and Deployment values that you want to use to perform monitoring for your prompt flow application.
Select Advanced options to see more options to configure.
Adjust the sampling rate and the thresholds for your configured metrics. Specify the email addresses that should receive alerts when the average score for a given metric falls below the threshold.

If data collection isn't enabled for your deployment, creation of a monitor enables collection of inferencing data to your blob storage. This task takes the deployment offline for a few minutes.
Select Create to create your monitor.

Use the following code to set up monitoring for your deployed prompt flow application:

from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
    MonitorSchedule,
    CronTrigger,
    MonitorDefinition,
    ServerlessSparkCompute,
    MonitoringTarget,
    AlertNotification,
    GenerationTokenStatisticsMonitorMetricThreshold,
    GenerationTokenStatisticsSignal,
    GenerationSafetyQualityMonitoringMetricThreshold,
    GenerationSafetyQualitySignal,
    BaselineDataRange,
    LlmData,
)
from azure.ai.ml.entities._inputs_outputs import Input
from azure.ai.ml.constants import MonitorTargetTasks, MonitorDatasetContext

# Authentication package
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

# Update your Azure resources details
subscription_id = "INSERT YOUR SUBSCRIPTION ID"
resource_group = "INSERT YOUR RESOURCE GROUP NAME"
project_name = "INSERT YOUR PROJECT NAME" # This is the same as your Azure AI Foundry project name
endpoint_name = "INSERT YOUR ENDPOINT NAME" # This is your deployment name without the suffix (e.g., deployment is "contoso-chatbot-1", endpoint is "contoso-chatbot")
deployment_name = "INSERT YOUR DEPLOYMENT NAME"
aoai_deployment_name ="INSERT YOUR AOAI DEPLOYMENT NAME"
aoai_connection_name = "INSERT YOUR AOAI CONNECTION NAME"

# These variables can be renamed but it is not necessary
app_trace_name = "app_traces"
app_trace_Version = "1"
monitor_name ="gen_ai_monitor_both_signals" 
defaulttokenstatisticssignalname ="token-usage-signal" 
defaultgsqsignalname ="gsq-signal"

# Determine the frequency to run the monitor, and the emails to receive email alerts
trigger_schedule = CronTrigger(expression="15 10 * * *")
notification_emails_list = ["test@example.com", "def@example.com"]

ml_client = MLClient(
    credential=credential,
    subscription_id=subscription_id,
    resource_group_name=resource_group,
    workspace_name=project_name,
)

spark_compute = ServerlessSparkCompute(instance_type="standard_e4s_v3", runtime_version="3.3")
monitoring_target = MonitoringTarget(
    ml_task=MonitorTargetTasks.QUESTION_ANSWERING,
    endpoint_deployment_id=f"azureml:{endpoint_name}:{deployment_name}",
)

# Set thresholds for passing rate (0.7 = 70%)
aggregated_groundedness_pass_rate = 0.7
aggregated_relevance_pass_rate = 0.7
aggregated_coherence_pass_rate = 0.7
aggregated_fluency_pass_rate = 0.7

# Create an instance of a gsq signal
generation_quality_thresholds = GenerationSafetyQualityMonitoringMetricThreshold(
    groundedness = {"aggregated_groundedness_pass_rate": aggregated_groundedness_pass_rate},
    relevance={"aggregated_relevance_pass_rate": aggregated_relevance_pass_rate},
    coherence={"aggregated_coherence_pass_rate": aggregated_coherence_pass_rate},
    fluency={"aggregated_fluency_pass_rate": aggregated_fluency_pass_rate},
)
input_data = Input(
    type="uri_folder",
    path=f"{endpoint_name}-{deployment_name}-{app_trace_name}:{app_trace_Version}",
)
data_window = BaselineDataRange(lookback_window_size="P7D", lookback_window_offset="P0D")
production_data = LlmData(
    data_column_names={"prompt_column": "question", "completion_column": "answer", "context_column": "context"},
    input_data=input_data,
    data_window=data_window,
)

gsq_signal = GenerationSafetyQualitySignal(
    connection_id=f"/subscriptions/{subscription_id}/resourceGroups/{resource_group}/providers/Microsoft.MachineLearningServices/workspaces/{project_name}/connections/{aoai_connection_name}",
    metric_thresholds=generation_quality_thresholds,
    production_data=[production_data],
    sampling_rate=1.0,
    properties={
        "aoai_deployment_name": aoai_deployment_name,
        "enable_action_analyzer": "false",
        "azureml.modelmonitor.gsq_thresholds": '[{"metricName":"average_fluency","threshold":{"value":4}},{"metricName":"average_coherence","threshold":{"value":4}}]',
    },
)

# Create an instance of a token statistic signal
token_statistic_signal = GenerationTokenStatisticsSignal()

monitoring_signals = {
    defaultgsqsignalname: gsq_signal,
    defaulttokenstatisticssignalname: token_statistic_signal,
}

monitor_settings = MonitorDefinition(
compute=spark_compute,
monitoring_target=monitoring_target,
monitoring_signals = monitoring_signals,
alert_notification=AlertNotification(emails=notification_emails_list),
)

model_monitor = MonitorSchedule(
    name = monitor_name,
    trigger=trigger_schedule,
    create_monitor=monitor_settings
)

ml_client.schedules.begin_create_or_update(model_monitor)

Consume monitoring results

After you create your monitor, it runs daily to compute the token usage and generation quality metrics.

Go to the Monitoring (preview) tab in the deployment to view the monitoring results. Here, you see an overview of monitoring results during the selected time window. Use the date picker to change the time window of data that you're monitoring. The following metrics are available in this overview:
- Total request count: The total number of requests sent to the deployment during the selected time window.
- Total token count: The total number of tokens used by the deployment during the selected time window.
- Prompt token count: The number of prompt tokens used by the deployment during the selected time window.
- Completion token count: The number of completion tokens used by the deployment during the selected time window.
View the metrics on the Token usage tab. (This tab is selected by default.) Here, you can view the token usage of your application over time. You can also view the distribution of prompt and completion tokens over time. You can change the Trendline scope value to monitor all tokens in the entire application or token usage for a particular deployment (for example, GPT-4) used within your application.
Go to the Generation quality tab to monitor the quality of your application over time. The following metrics are shown in the time chart:
- Violation count: The violation count for a given metric (for example, fluency) is the sum of violations over the selected time window. A violation occurs for a metric when the metrics are computed (default is daily) if the computed value for the metric falls below the set threshold value.
- Average score: The average score for a given metric (for example, fluency) is the sum of the scores for all instances (or requests) divided by the number of instances (or requests) over the selected time window.
The Generation quality violations card shows the violation rate over the selected time window. The violation rate is the number of violations divided by the total number of possible violations. You can adjust the thresholds for metrics in the settings. By default, metrics are computed daily. You can also adjust this frequency in the settings.
On the Monitoring (Preview) tab, you can also view a comprehensive table of all sampled requests sent to the deployment during the selected time window.

Monitoring sets the default sampling rate at 10%. For example, if 100 requests are sent to your deployment, 10 get sampled and are used to compute the generation quality metrics. You can adjust the sampling rate in the settings.
Select the Trace button on the right side of a row in the table to see tracing details for a given request. This view provides comprehensive trace details for the request to your application.
Close the trace view.
Go to the Operational tab to view the operational metrics for the deployment in near real time. We support the following operational metrics:
- Request count
- Latency
- Error rate

The results on the Monitoring (preview) tab of your deployment provide insights to help you proactively improve the performance of your prompt flow application.

Advanced monitoring configuration with SDK v2

Monitoring also supports advanced configuration options with the SDK v2. The following scenarios are supported.

Enable monitoring for token usage

If you want to enable only token usage monitoring for your deployed prompt flow application, adapt the following script to your scenario:

from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
    MonitorSchedule,
    CronTrigger,
    MonitorDefinition,
    ServerlessSparkCompute,
    MonitoringTarget,
    AlertNotification,
    GenerationTokenStatisticsSignal,
)
from azure.ai.ml.entities._inputs_outputs import Input
from azure.ai.ml.constants import MonitorTargetTasks, MonitorDatasetContext

# Authentication package
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

# Update your Azure resources details
subscription_id = "INSERT YOUR SUBSCRIPTION ID"
resource_group = "INSERT YOUR RESOURCE GROUP NAME"
project_name = "INSERT YOUR PROJECT NAME" # This is the same as your Azure AI Foundry project name
endpoint_name = "INSERT YOUR ENDPOINT NAME" # This is your deployment name without the suffix (e.g., deployment is "contoso-chatbot-1", endpoint is "contoso-chatbot")
deployment_name = "INSERT YOUR DEPLOYMENT NAME"

# These variables can be renamed but it is not necessary
monitor_name ="gen_ai_monitor_tokens" 
defaulttokenstatisticssignalname ="token-usage-signal" 

# Determine the frequency to run the monitor, and the emails to recieve email alerts
trigger_schedule = CronTrigger(expression="15 10 * * *")
notification_emails_list = ["test@example.com", "def@example.com"]

ml_client = MLClient(
    credential=credential,
    subscription_id=subscription_id,
    resource_group_name=resource_group,
    workspace_name=project_name,
)

spark_compute = ServerlessSparkCompute(instance_type="standard_e4s_v3", runtime_version="3.3")
monitoring_target = MonitoringTarget(
    ml_task=MonitorTargetTasks.QUESTION_ANSWERING,
    endpoint_deployment_id=f"azureml:{endpoint_name}:{deployment_name}",
)

# Create an instance of a token statistic signal
token_statistic_signal = GenerationTokenStatisticsSignal()

monitoring_signals = {
    defaulttokenstatisticssignalname: token_statistic_signal,
}

monitor_settings = MonitorDefinition(
compute=spark_compute,
monitoring_target=monitoring_target,
monitoring_signals = monitoring_signals,
alert_notification=AlertNotification(emails=notification_emails_list),
)

model_monitor = MonitorSchedule(
    name = monitor_name,
    trigger=trigger_schedule,
    create_monitor=monitor_settings
)

ml_client.schedules.begin_create_or_update(model_monitor)

Enable monitoring for generation quality

If you want to enable only generation quality monitoring for your deployed prompt flow application, adapt the following script to your scenario:

from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
    MonitorSchedule,
    CronTrigger,
    MonitorDefinition,
    ServerlessSparkCompute,
    MonitoringTarget,
    AlertNotification,
    GenerationSafetyQualityMonitoringMetricThreshold,
    GenerationSafetyQualitySignal,
    BaselineDataRange,
    LlmData,
)
from azure.ai.ml.entities._inputs_outputs import Input
from azure.ai.ml.constants import MonitorTargetTasks, MonitorDatasetContext

# Authentication package
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

# Update your Azure resources details
subscription_id = "INSERT YOUR SUBSCRIPTION ID"
resource_group = "INSERT YOUR RESOURCE GROUP NAME"
project_name = "INSERT YOUR PROJECT NAME" # This is the same as your Azure AI Foundry project name
endpoint_name = "INSERT YOUR ENDPOINT NAME" # This is your deployment name without the suffix (e.g., deployment is "contoso-chatbot-1", endpoint is "contoso-chatbot")
deployment_name = "INSERT YOUR DEPLOYMENT NAME"
aoai_deployment_name ="INSERT YOUR AOAI DEPLOYMENT NAME"
aoai_connection_name = "INSERT YOUR AOAI CONNECTION NAME"

# These variables can be renamed but it is not necessary
app_trace_name = "app_traces"
app_trace_Version = "1"
monitor_name ="gen_ai_monitor_generation_quality" 
defaultgsqsignalname ="gsq-signal"

# Determine the frequency to run the monitor and the emails to receive email alerts
trigger_schedule = CronTrigger(expression="15 10 * * *")
notification_emails_list = ["test@example.com", "def@example.com"]

ml_client = MLClient(
    credential=credential,
    subscription_id=subscription_id,
    resource_group_name=resource_group,
    workspace_name=project_name,
)

spark_compute = ServerlessSparkCompute(instance_type="standard_e4s_v3", runtime_version="3.3")
monitoring_target = MonitoringTarget(
    ml_task=MonitorTargetTasks.QUESTION_ANSWERING,
    endpoint_deployment_id=f"azureml:{endpoint_name}:{deployment_name}",
)

# Set thresholds for the passing rate (0.7 = 70%)
aggregated_groundedness_pass_rate = 0.7
aggregated_relevance_pass_rate = 0.7
aggregated_coherence_pass_rate = 0.7
aggregated_fluency_pass_rate = 0.7

# Create an instance of a gsq signal
generation_quality_thresholds = GenerationSafetyQualityMonitoringMetricThreshold(
    groundedness = {"aggregated_groundedness_pass_rate": aggregated_groundedness_pass_rate},
    relevance={"aggregated_relevance_pass_rate": aggregated_relevance_pass_rate},
    coherence={"aggregated_coherence_pass_rate": aggregated_coherence_pass_rate},
    fluency={"aggregated_fluency_pass_rate": aggregated_fluency_pass_rate},
)
input_data = Input(
    type="uri_folder",
    path=f"{endpoint_name}-{deployment_name}-{app_trace_name}:{app_trace_Version}",
)
data_window = BaselineDataRange(lookback_window_size="P7D", lookback_window_offset="P0D")
production_data = LlmData(
    data_column_names={"prompt_column": "question", "completion_column": "answer", "context_column": "context"},
    input_data=input_data,
    data_window=data_window,
)

gsq_signal = GenerationSafetyQualitySignal(
    connection_id=f"/subscriptions/{subscription_id}/resourceGroups/{resource_group}/providers/Microsoft.MachineLearningServices/workspaces/{project_name}/connections/{aoai_connection_name}",
    metric_thresholds=generation_quality_thresholds,
    production_data=[production_data],
    sampling_rate=1.0,
    properties={
        "aoai_deployment_name": aoai_deployment_name,
        "enable_action_analyzer": "false",
        "azureml.modelmonitor.gsq_thresholds": '[{"metricName":"average_fluency","threshold":{"value":4}},{"metricName":"average_coherence","threshold":{"value":4}}]',
    },
)

monitoring_signals = {
    defaultgsqsignalname: gsq_signal,
}

monitor_settings = MonitorDefinition(
compute=spark_compute,
monitoring_target=monitoring_target,
monitoring_signals = monitoring_signals,
alert_notification=AlertNotification(emails=notification_emails_list),
)

model_monitor = MonitorSchedule(
    name = monitor_name,
    trigger=trigger_schedule,
    create_monitor=monitor_settings
)

ml_client.schedules.begin_create_or_update(model_monitor)

After you create your monitor from the SDK, you can consume the monitoring results in the Azure AI Foundry portal.

Learn more about what you can do in Azure AI Foundry.
Get answers to frequently asked questions in the Azure AI Foundry FAQ.

Share via

Monitor quality and token usage of deployed prompt flow applications

Prerequisites

Requirements for monitoring metrics

Supported metrics for monitoring

Column name mapping

Parameters required for metrics

Set up monitoring for a prompt flow

Deploy your prompt flow application with inferencing data collection

Configure monitoring

Consume monitoring results

Advanced monitoring configuration with SDK v2

Enable monitoring for token usage

Enable monitoring for generation quality

Feedback

Additional resources

Share via

Monitor quality and token usage of deployed prompt flow applications

Prerequisites

Requirements for monitoring metrics

Supported metrics for monitoring

Column name mapping

Parameters required for metrics

Set up monitoring for a prompt flow

Deploy your prompt flow application with inferencing data collection

Configure monitoring

Consume monitoring results

Advanced monitoring configuration with SDK v2

Enable monitoring for token usage

Enable monitoring for generation quality

Related content

Feedback

Additional resources