运行评估并查看结果

项目
02/10/2025

重要

此功能目前以公共预览版提供。

本文介绍如何在开发 AI 应用程序时运行评估并查看结果。有关如何监视生产流量上已部署代理的质量的信息，请参阅如何监视代理在生产流量上的质量。

若要评估代理，必须指定评估集。评估数据集至少是一组应用程序的用户请求集合，这些请求可能来自一组特选的评估请求，也可能来自代理用户的记录。有关详细信息，请参阅评估集和代理评估输入架构。

运行评估

若要运行评估，请使用 MLflow API 中的 mlflow.evaluate（）方法，将 model_type 指定为 databricks-agent，以便对 Databricks 启用代理评估，内置 AI 法官。

以下示例为全局准则 AI 裁判指定了一组全局响应准则，当响应不符合这些准则时，会导致评估失败。使用此方法时，无需收集每个请求的标签即可评估代理。

import mlflow
from mlflow.deployments import get_deploy_client

# The guidelines below will be used to evaluate any response of the agent.
global_guidelines = [
  "If the request is unrelated to Databricks, the response must should be a rejection of the request",
  "If the request is related to Databricks, the response must should be concise",
  "If the request is related to Databricks and question about API, the response must have code",
  "The response must be professional."
]

eval_set = [{
  "request": {"messages": [{"role": "user", "content": "What is the difference between reduceByKey and groupByKey in Databricks Spark?"}]}
}, {
  "request": "What is the weather today?",
}]

# Define a very simple system-prompt agent.
@mlflow.trace(span_type="AGENT")
def llama3_agent(messages):
  SYSTEM_PROMPT = """
    You are a chatbot that answers questions about Databricks.
    For requests unrelated to Databricks, reject the request.
  """
  return get_deploy_client("databricks").predict(
    endpoint="databricks-meta-llama-3-3-70b-instruct",
    inputs={"messages": [{"role": "system", "content": SYSTEM_PROMPT}, *messages]}
  )

# Evaluate the Agent with the evaluation set and log it to the MLFlow run "system_prompt_v0".
with mlflow.start_run(run_name="system_prompt_v0") as run:
  mlflow.evaluate(
    data=eval_set,
    model=lambda request: llama3_agent(**request),
    model_type="databricks-agent",
    evaluator_config={
      "databricks-agent": {
        "global_guidelines": global_guidelines
      }
    }
  )

此示例运行以下不需要基本事实标签的判定：准则遵循、与查询的相关性、安全性。

如果将代理与检索器配合使用，则运行以下判定：有据性、区块相关性

mlflow.evaluate() 还计算每个评估记录的延迟和成本指标，聚合给定运行的所有输入的结果。这些称为评估结果。评估结果与其他命令记录的信息（例如模型参数）会记录在封闭的运行中。如果在 MLflow 运行外部调用 mlflow.evaluate()，则会创建新的运行。

使用基本事实标签进行评估

以下示例指定了每行基本事实标签：expected_facts 和 guidelines，分别将运行正确性和准则判定标准。使用每行基本事实标签分别处理单个评估。

%pip install databricks-agents
dbutils.library.restartPython()

import mlflow
from mlflow.types.llm import ChatCompletionResponse, ChatCompletionRequest
from mlflow.deployments import get_deploy_client
import dataclasses

eval_set = [{
  "request": "What is the difference between reduceByKey and groupByKey in Databricks Spark?",
  "expected_facts": [
    "reduceByKey aggregates data before shuffling",
    "groupByKey shuffles all data",
  ],
  "guidelines": ["The response must be concice and show a code snippet."]
}, {
  "request": "What is the weather today?",
  "guidelines": ["The response must reject the request."]
}]

# Define a very simple system-prompt agent.
@mlflow.trace(span_type="AGENT")
def llama3_agent(messages):
  SYSTEM_PROMPT = """
    You are a chatbot that answers questions about Databricks.
    For requests unrelated to Databricks, reject the request.
  """
  return get_deploy_client("databricks").predict(
    endpoint="databricks-meta-llama-3-3-70b-instruct",
    inputs={"messages": [{"role": "system", "content": SYSTEM_PROMPT}, *messages]}
  )

# Evaluate the agent with the evaluation set and log it to the MLFlow run "system_prompt_v0".
with mlflow.start_run(run_name="system_prompt_v0") as run:
  mlflow.evaluate(
    data=eval_set,
    model=lambda request: llama3_agent(**request),
    model_type="databricks-agent"
  )

此示例不仅运行与上述判定相同的判定，还运行以下判定：正确性、相关性、安全性

如果将代理与检索器配合使用，则运行以下判定：上下文充分性

要求

必须为工作区启用 Azure AI 支持的 AI 辅助功能。

向评估运行提供输入

可通过两种方式为评估运行提供输入：

提供先前生成的输出，以便与评估集进行比较。 如果要评估已部署到生产的应用程序的输出，或者要比较评估配置之间的评估结果，则建议使用此选项。

使用此选项，可以指定一个评估集，如以下代码所示。评估集必须包含以前生成的输出。有关更详细的示例，请参阅示例：如何将以前生成的输出传递给代理评估。
```
evaluation_results = mlflow.evaluate(
    data=eval_set_with_chain_outputs_df,  # pandas DataFrame with the evaluation set and application outputs
    model_type="databricks-agent",
)
```
将应用程序作为输入参数传递。mlflow.evaluate()针对评估集中的每个输入调用应用程序，并报告每个生成的输出的质量评估和其他指标。如果应用程序是使用启用了 MLflow 跟踪的 MLflow 记录的，或者应用程序是作为笔记本中的 Python 函数实现的，则建议使用此选项。如果应用程序是在 Databricks 之外开发的或在 Databricks 之外部署的，则不建议使用此选项。

使用此选项，可以在函数调用中指定计算集和应用程序，如以下代码所示。有关更详细的示例，请参阅示例：如何将应用程序传递给代理评估。
```
evaluation_results = mlflow.evaluate(
    data=eval_set_df,  # pandas DataFrame containing just the evaluation set
    model=model,  # Reference to the MLflow model that represents the application
    model_type="databricks-agent",
)
```

有关评估集架构的详细信息，请参阅代理评估输入架构。

评估输出

代理评估从 mlflow.evaluate() 数据帧返回其输出，并将这些输出记录到 MLflow 运行中。可以在笔记本中或从相应 MLflow 运行的页面检查输出。

查看笔记本中的输出

以下代码演示了一些有关如何从笔记本中查看评估运行结果的示例。

%pip install databricks-agents pandas
dbutils.library.restartPython()

import mlflow
import pandas as pd

###
# Run evaluation
###
evaluation_results = mlflow.evaluate(..., model_type="databricks-agent")

###
# Access aggregated evaluation results across the entire evaluation set
###
results_as_dict = evaluation_results.metrics
results_as_pd_df = pd.DataFrame([evaluation_results.metrics])

# Sample usage
print(f"The percentage of generated responses that are grounded: {results_as_dict['response/llm_judged/groundedness/percentage']}")

###
# Access data about each question in the evaluation set
###

per_question_results_df = evaluation_results.tables['eval_results']

# Show information about responses that are not grounded
per_question_results_df[per_question_results_df["response/llm_judged/groundedness/rating"] == "no"].display()

数据 per_question_results_df 帧包括输入架构中的所有列，以及特定于每个请求的所有评估结果。有关计算结果的更多详细信息，请参阅代理评估如何评估质量、成本和延迟。

使用 MLflow UI 查看输出

评估结果也可以在 MLflow UI 中找到。若要访问 MLflow UI，请单击笔记本右边栏中的“试验”图标试验图标，然后单击相应的运行，或者单击运行 mlflow.evaluate() 的笔记本单元格的单元格结果中显示的链接。

查看单个运行的评估结果

本部分介绍如何查看单个运行的评估结果。若要比较各运行的结果，请参阅比较各运行之间的评估结果。

LLM 评委的质量评估概述

每个请求的法官评估在 databricks-agents 版本 0.3.0 及更高版本中可用。

若要查看评估集中每个请求的 LLM 判断质量概述，请单击 MLflow 运行页上的“评估结果 ”选项卡。本页显示每个评估运行的摘要表。有关详细信息，请单击运行的评估 ID。

overview_judges

本概述显示每个请求的不同评委的评估、基于这些评估的每个请求的质量通过/失败状态，以及失败请求的根本原因。单击表中的一行将带你访问该请求的详细信息页面，其中包括以下内容：

模型输出：代理应用生成的响应及其跟踪（如有包含）。
预期输出：每个请求的预期响应。
详细评估：LLM 的评估就是根据此数据判定的。单击“查看详细信息”显示判定提供的理由。

details_judges

整个评估集的聚合结果

若要查看整个评估集的聚合结果，请单击“ 概述 ”选项卡（对于数值）或 “模型指标 ”选项卡（对于图表）。

评估指标，值

评估指标，图表

比较各运行中的评估结果

请务必比较各运行中的评估结果，以了解代理应用程序如何响应更改。比较结果有助于了解更改是否对质量产生积极影响，或帮助你对更改行为进行故障排除。

比较各运行的每个请求结果

若要比较各运行每个单独请求的数据，请单击“试验”页面上的“评估”选项卡。表格将显示评估集中的每个问题。使用下拉菜单选择要查看的列。

评估集中的各个问题

比较跨运行聚合的结果

可以从“试验”页访问相同的聚合结果，这也允许跨不同运行比较结果。若要访问“试验”页面，请单击笔记本右边栏中的“试验”图标试验图标，或者单击运行 mlflow.evaluate() 的笔记本单元格的单元格结果中显示的链接。

在“试验”页上，单击。这样，就可以直观显示所选运行的聚合结果，并将其与过去的运行进行比较。

聚合结果

运行哪些评委

默认情况下，对于每个评估记录，马赛克 AI 代理评估会应用与记录中提供的信息最匹配的法官子集。具体而言：

如果记录包括基本事实响应，代理评估将应用 context_sufficiency、groundedness、correctness、safety 和 guideline_adherence 判定标准。
如果记录不包含基本事实响应，代理评估将应用 chunk_relevance、groundedness、relevance_to_query、safety 和 guideline_adherence 判定标准。

有关更多详细信息，请参阅：

有关 LLM 法官的信任和安全信息，请参阅有关为 LLM 法官提供支持的模型的信息。

示例：如何将应用程序传递给代理评估

若要将应用程序 mlflow_evaluate()传递给该参数，请使用 model 参数。参数中 model 传递应用程序有 5 个选项。

在 Unity 目录中注册的模型。
当前 MLflow 试验中的 MLflow 记录模型。
笔记本中加载的 PyFunc 模型。
笔记本中的本地函数。
已部署的代理终结点。

有关演示每个选项的代码示例，请参阅以下部分。

选项 1. 在 Unity 目录中注册的模型

%pip install databricks-agents pandas
dbutils.library.restartPython()

import mlflow
import pandas as pd

evaluation_results = mlflow.evaluate(
    data=eval_set_df,  # pandas DataFrame with just the evaluation set
    model = "models:/catalog.schema.model_name/1"  # 1 is the version number
    model_type="databricks-agent",
)

选项 2. 当前 MLflow 试验中的 MLflow 记录模型

%pip install databricks-agents pandas
dbutils.library.restartPython()

import mlflow
import pandas as pd

# In the following lines, `6b69501828264f9s9a64eff825371711` is the run_id, and `chain` is the artifact_path that was
# passed with mlflow.xxx.log_model(...).
# If you called model_info = mlflow.langchain.log_model() or mlflow.pyfunc.log_model(), you can access this value using `model_info.model_uri`.
evaluation_results = mlflow.evaluate(
    data=eval_set_df,  # pandas DataFrame with just the evaluation set
    model = "runs:/6b69501828264f9s9a64eff825371711/chain"
    model_type="databricks-agent",
)

选项 3. 笔记本中加载的 PyFunc 模型

%pip install databricks-agents pandas
dbutils.library.restartPython()

import mlflow
import pandas as pd

evaluation_results = mlflow.evaluate(
    data=eval_set_df,  # pandas DataFrame with just the evaluation set
    model = mlflow.pyfunc.load_model(...)
    model_type="databricks-agent",
)

选项 4：笔记本中的本地函数

该函数接收格式如下的输入：

{
  "messages": [
    {
      "role": "user",
      "content": "What is MLflow?",
    }
  ],
  ...
}

该函数必须使用以下三种支持格式之一返回值：

包含模型的响应的纯字符串。

格式的 ChatCompletionResponse 字典。例如：

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "MLflow is a machine learning toolkit.",
      },
     ...
    }
  ],
  ...,
}

格式的 StringResponse 字典，例如 { "content": "MLflow is a machine learning toolkit.", ... }。

以下示例使用本地函数包装基础模型终结点并对其进行评估：

  %pip install databricks-agents pandas
  dbutils.library.restartPython()

  import mlflow
  import pandas as pd

  def model(model_input):
    client = mlflow.deployments.get_deploy_client("databricks")
    return client.predict(endpoint="endpoints:/databricks-meta-llama-3-1-405b-instruct", inputs={"messages": model_input["messages"]})

  evaluation_results = mlflow.evaluate(
    data=eval_set_df,  # pandas DataFrame with just the evaluation set
    model = model
    model_type="databricks-agent",
  )

选项 5. 已部署的代理终结点

仅当使用已使用 databricks.agents.deploy SDK databricks-agents 版本 0.8.0 或更高版本部署的代理终结点时，此选项才有效。对于基础模型或较旧的 SDK 版本，请使用选项 4 将模型包装在本地函数中。

%pip install databricks-agents pandas
dbutils.library.restartPython()

import mlflow
import pandas as pd

# In the following lines, `endpoint-name-of-your-agent` is the name of the agent endpoint.
evaluation_results = mlflow.evaluate(
    data=eval_set_df,  # pandas DataFrame with just the evaluation set
    model = "endpoints:/endpoint-name-of-your-agent"
    model_type="databricks-agent",
)

如何在调用中包含 `mlflow_evaluate()` 应用程序时传递评估集

在以下代码中， data 是具有评估集的 pandas DataFrame。这些是简单的示例。有关详细信息，请参阅输入架构。

# You do not have to start from a dictionary - you can use any existing pandas or Spark DataFrame with this schema.

# Minimal evaluation set
bare_minimum_eval_set_schema = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
    }]

# Complete evaluation set
complete_eval_set_schema = [
    {
        "request_id": "your-request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_retrieved_context": [
            {
                # In `expected_retrieved_context`, `content` is optional, and does not provide any additional functionality.
                "content": "Answer segment 1 related to What is the difference between reduceByKey and groupByKey in Spark?",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "Answer segment 2 related to What is the difference between reduceByKey and groupByKey in Spark?",
                "doc_uri": "doc_uri_2_2",
            },
        ],
        "expected_response": "There's no significant difference.",
    }]

# Convert dictionary to a pandas DataFrame
eval_set_df = pd.DataFrame(bare_minimum_eval_set_schema)

# Use a Spark DataFrame
import numpy as np
spark_df = spark.table("catalog.schema.table") # or any other way to get a Spark DataFrame
eval_set_df = spark_df.toPandas()

示例：如何将以前生成的输出传递给代理评估

本部分介绍如何在调用中 mlflow_evaluate() 传递以前生成的输出。有关所需的评估集架构，请参阅代理评估输入架构。

在以下代码中，是一个 pandas DataFrame， data 其中包含由应用程序生成的评估集和输出。这些是简单的示例。有关详细信息，请参阅输入架构。

%pip install databricks-agents pandas
dbutils.library.restartPython()

import mlflow
import pandas as pd

evaluation_results = mlflow.evaluate(
    data=eval_set_with_app_outputs_df,  # pandas DataFrame with the evaluation set and application outputs
    model_type="databricks-agent",
)

# You do not have to start from a dictionary - you can use any existing pandas or Spark DataFrame with this schema.

# Minimum required input
bare_minimum_input_schema = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
    }]

# Input including optional arguments
complete_input_schema  = [
    {
        "request_id": "your-request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_retrieved_context": [
            {
                # In `expected_retrieved_context`, `content` is optional, and does not provide any additional functionality.
                "content": "Answer segment 1 related to What is the difference between reduceByKey and groupByKey in Spark?",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "Answer segment 2 related to What is the difference between reduceByKey and groupByKey in Spark?",
                "doc_uri": "doc_uri_2_2",
            },
        ],
        "expected_response": "There's no significant difference.",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional. If provided, the Databricks Context Relevance LLM Judge is executed to check the `content`'s relevance to the `request`.
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
        "guidelines": [
          "The response must be in English",
        ]
    }]

# Convert dictionary to a pandas DataFrame
eval_set_with_app_outputs_df = pd.DataFrame(bare_minimum_input_schema)

# Use a Spark DataFrame
import numpy as np
spark_df = spark.table("catalog.schema.table") # or any other way to get a Spark DataFrame
eval_set_with_app_outputs_df = spark_df.toPandas()

示例：使用自定义函数处理来自 LangGraph 的响应

LangGraph 代理（尤其是具有聊天功能的代理）可以为单个推理调用返回多个消息。用户负责将代理的响应转换为代理评估支持的格式。

一种方法是使用自定义函数来处理响应。以下示例演示一个自定义函数，用于从 LangGraph 模型中提取最后一条聊天消息。然后，此函数用于 mlflow.evaluate() 返回单个字符串响应，可以将其与 ground_truth 列进行比较。

示例代码做出以下假设：

模型接受采用 {“messages”： [{“role”： “user”， “content”： “hello”}]} 格式的输入。
模型返回格式为 [“response 1”， “response 2”的字符串列表。

以下代码以以下格式向法官发送串联响应：“response 1nresponse2”

import mlflow
import pandas as pd
from typing import List

loaded_model = mlflow.langchain.load_model(model_uri)
eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        "expected_response": [
            "MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.",
            "Apache Spark is an open-source, distributed computing system designed for big data processing and analytics. It was developed in response to limitations of the Hadoop MapReduce computing model, offering improvements in speed and ease of use. Spark provides libraries for various tasks such as data ingestion, processing, and analysis through its components like Spark SQL for structured data, Spark Streaming for real-time data processing, and MLlib for machine learning tasks",
        ],
    }
)

def custom_langgraph_wrapper(model_input):
    predictions = loaded_model.invoke({"messages": model_input["messages"]})
    # Assuming `predictions` is a list of strings
    return predictions.join("\n")

with mlflow.start_run() as run:
    results = mlflow.evaluate(
        custom_langgraph_wrapper,  # Pass the function defined above
        data=eval_data,
        model_type="databricks-agent",
    )

print(results.metrics)

使用指标创建仪表板

在循环访问代理的质量时，你可能希望与利益干系人共享仪表板，显示质量随时间推移如何改进。可以从 MLflow 评估运行中提取指标，将值保存到 Delta 表中，并创建仪表板。

以下示例演示如何从笔记本中的最新评估运行中提取和保存指标值：

uc_catalog_name = "catalog"
uc_schema_name = "schema"
table_name = "results"

eval_results = mlflow.evaluate(
    model=logged_agent_info.model_uri, # use the logged Agent
    data=evaluation_set, # Run the logged Agent for all queries defined above
    model_type="databricks-agent", # use Agent Evaluation
)

# The `append_metrics_to_table function` is defined below
append_metrics_to_table("<identifier-for-table>", eval_results.metrics, f"{uc_catalog_name}.{uc_schema_name}.{table_name}")

以下示例演示如何提取和保存在 MLflow 试验中保存的过去运行的指标值。

import pandas as pd

def get_mlflow_run(experiment_name, run_name):
  runs = mlflow.search_runs(experiment_names=[experiment_name], filter_string=f"run_name = '{run_name}'", output_format="list")

  if len(runs) != 1:
    raise ValueError(f"Found {len(runs)} runs with name {run_name}. {run_name} must identify a single run. Alternatively, you can adjust this code to search for a run based on `run_id`")

   return runs[0]

run = get_mlflow_run(experiment_name ="/Users/<user_name>/db_docs_mlflow_experiment", run_name="evaluation__2024-10-09_02:27:17_AM")

# The `append_metrics_to_table` function is defined below
append_metrics_to_table("<identifier-for-table>", run.data.metrics, f"{uc_catalog_name}.{uc_schema_name}.{table_name}")

现在可以使用此数据创建仪表板。

以下代码定义在前面的示例中使用的函数 append_metrics_to_table 。

# Definition of `append_metrics_to_table`

def append_metrics_to_table(run_name, mlflow_metrics, delta_table_name):
  data = mlflow_metrics.copy()

  # Add identifying run_name and timestamp
  data["run_name"] = run_name
  data["timestamp"] = pd.Timestamp.now()

  # Remove metrics with error counts
  data = {k: v for k, v in mlflow_metrics.items() if "error_count" not in k}

  # Convert to a Spark DataFrame(
  metrics_df = pd.DataFrame([data])
  metrics_df_spark = spark.createDataFrame(metrics_df)

  # Append to the Delta table
  metrics_df_spark.write.mode("append").saveAsTable(delta_table_name)

有关为 LLM 判定提供支持的模型的信息

LLM 判定可能会使用第三方服务来评估 GenAI 应用程序，包括由 Microsoft 运营的 Azure OpenAI。
对于 Azure OpenAI，Databricks 已选择退出“滥用监视”，因此不会通过 Azure OpenAI 存储任何提示或响应。
对于欧盟 (EU) 工作区，LLM 判定使用托管在 EU 的模型。所有其他区域使用托管在美国的模型。
禁用 Azure AI 支持的 AI 辅助功能即会阻止 LLM 判定调用 Azure AI 支持的模型。
发送到 LLM 判定的数据不用于任何模型训练。
LLM 判定旨在帮助客户评估其 RAG 应用程序，LLM 判定输出不应用于训练、改进或微调 LLM。

通过

运行评估并查看结果

运行评估

使用基本事实标签进行评估

要求

向评估运行提供输入

评估输出

查看笔记本中的输出

使用 MLflow UI 查看输出

查看单个运行的评估结果

LLM 评委的质量评估概述

整个评估集的聚合结果

比较各运行中的评估结果

比较各运行的每个请求结果

比较跨运行聚合的结果

运行哪些评委

示例：如何将应用程序传递给代理评估

选项 1. 在 Unity 目录中注册的模型

选项 2. 当前 MLflow 试验中的 MLflow 记录模型

选项 3. 笔记本中加载的 PyFunc 模型

选项 4：笔记本中的本地函数

选项 5. 已部署的代理终结点

如何在调用中包含 `mlflow_evaluate()` 应用程序时传递评估集

示例：如何将以前生成的输出传递给代理评估

示例：使用自定义函数处理来自 LangGraph 的响应

使用指标创建仪表板

有关为 LLM 判定提供支持的模型的信息

反馈

其他资源

通过

运行评估并查看结果

运行评估

使用基本事实标签进行评估

要求

向评估运行提供输入

评估输出

查看笔记本中的输出

使用 MLflow UI 查看输出

查看单个运行的评估结果

LLM 评委的质量评估概述

整个评估集的聚合结果

比较各运行中的评估结果

比较各运行的每个请求结果

比较跨运行聚合的结果

运行哪些评委

示例：如何将应用程序传递给代理评估

选项 1. 在 Unity 目录中注册的模型

选项 2. 当前 MLflow 试验中的 MLflow 记录模型

选项 3. 笔记本中加载的 PyFunc 模型

选项 4： 笔记本中的本地函数

选项 5. 已部署的代理终结点

如何在调用中包含 mlflow_evaluate() 应用程序时传递评估集

示例：如何将以前生成的输出传递给代理评估

示例：使用自定义函数处理来自 LangGraph 的响应

使用指标创建仪表板

有关为 LLM 判定提供支持的模型的信息

反馈

其他资源

选项 4：笔记本中的本地函数

如何在调用中包含 `mlflow_evaluate()` 应用程序时传递评估集