Customize LLM judges
Important
This feature is in Public Preview.
This article describes several techniques you can use to customize the LLM judges used to evaluate evaluate the quality and latency of agentic AI applications. It covers the following techniques:
- Create custom LLM judges.
- Provide few-shot examples to LLM judges.
- Evaluate applications using only a subset of LLM judges.
Select built-in judges to run
By default, for each evaluation record, Agent Evaluation applies the subset of built-in judges that best matches the information present in the record. You can explicitly specify the judges to apply to each request by using the evaluator_config
argument of mlflow.evaluate()
. For details, see Which judges are run.
Create custom LLM judges
The following are common use cases where customer-defined judges might be useful:
- Evaluate your application against criteria that are specific to your business use case. For example:
- Assess if your application produces responses that align with your corporate tone of voice
- Determine if your applications’ response always follows a specific format.
- Testing and iterating on guardrails. You can use your guardrail’s prompt in the customer-defined judge and iterate towards a prompt that works well. You would then implement the guardrail and use the LLM judge to evaluate how often the guardrail is or isn’t working.
Databricks refers to these use cases as assessments. There are two types of customer-defined LLM assessments:
Type | What does it assess? | How is the score reported? |
---|---|---|
Answer assessment | The LLM judge is called for each generated answer. For example, if you had 5 questions with corresponding answers, the judge would be called 5 times (once for each answer). | For each answer, a yes or no is reported based on your criteria. yes outputs are aggregated to a percentage for the entire evaluation set. |
Retrieval assessment | Perform assessment for each retrieved chunk (if the applicaiton performs retrieval). For each question, the LLM judge is called for each chunk that was retrieved for that question. For example, if you had 5 questions and each had 3 retrieved chunks, the judge would be called 15 times. | For each chunk, a yes or no is reported based on your criteria. For each question, the percent of yes chunks is reported as a precision. Precision per question is aggregated to an average precision for the entire evaluation set. |
You can configure a customer-defined LLM judge using the following parameters:
Option | Description | Requirements |
---|---|---|
model |
The endpoint name for the Foundation Model API endpoint that is to receive requests for this custom judge. | Endpoint must support the /llm/v1/chat signature. |
name |
The name of the assessment that is also used for the output metrics. | |
judge_prompt |
The prompt that implements the assessment, with variables enclosed in curly braces. For example, “Here is a definition that uses {request} and {response}”. | |
metric_metadata |
A dictionary that provides additional parameters for the judge. Notably, the dictionary must include a "assessment_type" with value either "RETRIEVAL" or "ANSWER" to specify the assessment type. |
The prompt contains variables that are substituted by the contents of the evaluation set before it is sent to the specified endpoint_name
to retrieve the response. The prompt is minimally wrapped in formatting instructions that parse a numerical score in [1,5] and a rationale from the judge’s output. The parsed score is then transformed into yes
if it is higher than 3 and no
otherwise (see the sample code below on how to use the metric_metadata
to change the default threshold of 3). The prompt should contain instructions on the interpretation of these different scores, but the prompt should avoid instructions that specify an output format.
The following variables are supported:
Variable | ANSWER assessment |
RETRIEVAL assessment |
---|---|---|
request |
Request column of the evaluation data set | Request column of the evaluation data set |
response |
Response column of the evaluation data set | Response column of the evaluation data set |
expected_response |
expected_response column of the evaluation data set |
expected_response column of the evaluation data set |
retrieved_context |
Concatenated contents from retrieved_context column |
Individual content in retrieved_context column |
Important
For all custom judges, Agent Evaluation assumes that yes
corresponds to a positive assessment of quality. That is, an example that passes the judge’s evaluation should always return yes
. For example, a judge should evaluate “is the response safe?” or “is the tone friendly and professional?”, not “does the response contain unsafe material?” or “is the tone unprofessional?”.
The following example uses MLflow’s make_genai_metric_from_prompt
API to specify the no_pii
object, which is passed into the extra_metrics
argument in mlflow.evaluate
as a list during evaluation.
from mlflow.metrics.genai import make_genai_metric_from_prompt
import mlflow
import pandas as pd
# Create the evaluation set
evals = pd.DataFrame({
"request": [
"What is Spark?",
"How do I convert a Spark DataFrame to Pandas?",
],
"response": [
"Spark is a data analytics framework. And my email address is noreply@databricks.com",
"This is not possible as Spark is not a panda.",
],
})
# `make_genai_metric_from_prompt` assumes that a value greater than 3 is passing and less than 3 is failing.
# Therefore, when you tune the custom judge prompt, make it emit 5 for pass or 1 for fail.
# When you create a prompt, keep in mind that the judges assume that `yes` corresponds to a positive assessment of quality.
# In this example, the metric name is "no_pii", to indicate that in the passing case, no PII is present.
# When the metric passes, it emits "5" and the UI shows a green "pass".
no_pii_prompt = """
Your task is to determine whether the retrieved content includes PII information (personally identifiable information).
You should output a 5 if there is no PII, a 1 if there is PII. This was the content: '{response}'"""
no_pii = make_genai_metric_from_prompt(
name="no_pii",
judge_prompt=no_pii_prompt,
model="endpoints:/databricks-meta-llama-3-1-405b-instruct",
metric_metadata={"assessment_type": "ANSWER"},
)
result = mlflow.evaluate(
data=evals,
# model=logged_model.model_uri, # For an MLflow model, `retrieved_context` and `response` are obtained from calling the model.
model_type="databricks-agent", # Enable Mosaic AI Agent Evaluation
extra_metrics=[no_pii],
)
# Process results from the custom judges.
per_question_results_df = result.tables['eval_results']
# Show information about responses that have PII.
per_question_results_df[per_question_results_df["response/llm_judged/no_pii/rating"] == "no"].display()
Provide examples to the built-in LLM judges
You can pass domain-specific examples to the built-in judges by providing a few "yes"
or "no"
examples for each type of assessment. These examples are referred to as few-shot examples and can help the built-in judges align better with domain-specific rating criteria. See Create few-shot examples.
Databricks recommends providing at least one "yes"
and one "no"
example. The best examples are the following:
- Examples that the judges previously got wrong, where you provide a correct response as the example.
- Challenging examples, such as examples that are nuanced or difficult to determine as true or false.
Databricks also recommends that you provide a rationale for the response. This helps improve the judge’s ability to explain its reasoning.
To pass the few-shot examples, you need to create a dataframe that mirrors the output of mlflow.evaluate()
for the corresponding judges. Here is an example for the answer-correctness, groundedness, and chunk-relevance judges:
%pip install databricks-agents pandas
dbutils.library.restartPython()
import mlflow
import pandas as pd
examples = {
"request": [
"What is Spark?",
"How do I convert a Spark DataFrame to Pandas?",
"What is Apache Spark?"
],
"response": [
"Spark is a data analytics framework.",
"This is not possible as Spark is not a panda.",
"Apache Spark occurred in the mid-1800s when the Apache people started a fire"
],
"retrieved_context": [
[
{"doc_uri": "context1.txt", "content": "In 2013, Spark, a data analytics framework, was open sourced by UC Berkeley's AMPLab."}
],
[
{"doc_uri": "context2.txt", "content": "To convert a Spark DataFrame to Pandas, you can use the toPandas() method."}
],
[
{"doc_uri": "context3.txt", "content": "Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning, and graph processing."}
]
],
"expected_response": [
"Spark is a data analytics framework.",
"To convert a Spark DataFrame to Pandas, you can use the toPandas() method.",
"Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning, and graph processing."
],
"response/llm_judged/correctness/rating": [
"Yes",
"No",
"No"
],
"response/llm_judged/correctness/rationale": [
"The response correctly defines Spark given the context.",
"This is an incorrect response as Spark can be converted to Pandas using the toPandas() method.",
"The response is incorrect and irrelevant."
],
"response/llm_judged/groundedness/rating": [
"Yes",
"No",
"No"
],
"response/llm_judged/groundedness/rationale": [
"The response correctly defines Spark given the context.",
"The response is not grounded in the given context.",
"The response is not grounded in the given context."
],
"retrieval/llm_judged/chunk_relevance/ratings": [
["Yes"],
["Yes"],
["Yes"]
],
"retrieval/llm_judged/chunk_relevance/rationales": [
["Correct document was retrieved."],
["Correct document was retrieved."],
["Correct document was retrieved."]
]
}
examples_df = pd.DataFrame(examples)
"""
Include the few-shot examples in the evaluator_config
parameter of mlflow.evaluate
.
evaluation_results = mlflow.evaluate(
...,
model_type="databricks-agent",
evaluator_config={"databricks-agent": {"examples_df": examples_df}}
)
Create few-shot examples
The following steps are guidelines to create a set of effective few-shot examples.
- Try to find groups of similar examples that the judge gets wrong.
- For each group, pick a single example and adjust the label or justification to reflect the desired behavior. Databricks recommends providing a rationale that explains the rating.
- Re-run the evaluation with the new example.
- Repeat as needed to target different categories of errors.
Note
Multiple few-shot examples can negatively impact judge performance. During evaluation, a limit of five few-shot examples is enforced. Databricks recommends using fewer, targeted examples for best performance.