Evaluate a model's response

Article
02/28/2025

In this quickstart, you create an MSTest app to evaluate the chat response of a model. The test app uses the Microsoft.Extensions.AI.Evaluation libraries.

Prerequisites

Install .NET 8.0 or a later version
Install Ollama locally on your machine
Visual Studio Code (optional)

Run the local AI model

Complete the following steps to configure and run a local AI model on your device. For this quickstart, you'll use the general purpose phi3:mini model, which is a small but capable generative AI created by Microsoft.

Open a terminal window and verify that Ollama is available on your device:
```
ollama
```
If Ollama is available, it displays a list of available commands.
Start Ollama:
```
ollama serve
```
If Ollama is running, it displays a list of available commands.
Pull the phi3:mini model from the Ollama registry and wait for it to download:
```
ollama pull phi3:mini
```
After the download completes, run the model:
```
ollama run phi3:mini
```
Ollama starts the phi3:mini model and provides a prompt for you to interact with it.

Create the test app

Complete the following steps to create an MSTest project that connects to your local phi3:mini AI model.

In a terminal window, navigate to the directory where you want to create your app, and create a new MSTest app with the dotnet new command:
```
dotnet new mstest -o TestAI
```

Navigate to the TestAI directory, and add the necessary packages to your app:

dotnet add package Microsoft.Extensions.AI.Ollama --prerelease
dotnet add package Microsoft.Extensions.AI.Abstractions --prerelease
dotnet add package Microsoft.Extensions.AI.Evaluation --prerelease
dotnet add package Microsoft.Extensions.AI.Evaluation.Quality --prerelease

Open the new app in your editor of choice, such as Visual Studio Code.
```
code .
```

Add the test app code

Rename the file Test1.cs to MyTests.cs, and then open the file and rename the class to MyTests.

Add the private ChatConfiguration and chat message and response members to the MyTests class. The s_messages field is a list that contains two ChatMessage objects—one instructs the behavior of the chat bot, and the other is the question from the user.

private static ChatConfiguration? s_chatConfiguration;
private static IList<ChatMessage> s_messages = [
    new ChatMessage(
        ChatRole.System,
        """
        You're an AI assistant that can answer questions related to astronomy.
        Keep your responses concise and try to stay under 100 words.
        Use the imperial measurement system for all measurements in your response.
        """),
    new ChatMessage(
        ChatRole.User,
        "How far is the planet Venus from Earth at its closest and furthest points?")];
private static ChatMessage s_response = new();

Add the InitializeAsync method to the MyTests class.

[ClassInitialize]
public static async Task InitializeAsync(TestContext _)
{
    /// Set up the <see cref="ChatConfiguration"/>,
    /// which includes the <see cref="IChatClient"/> that the
    /// evaluator uses to communicate with the model.
    s_chatConfiguration = GetOllamaChatConfiguration();

    var chatOptions =
        new ChatOptions
        {
            Temperature = 0.0f,
            ResponseFormat = ChatResponseFormat.Text
        };

    // Fetch the response to be evaluated
    // and store it in a static variable.
    ChatResponse response = await s_chatConfiguration.ChatClient.GetResponseAsync(s_messages, chatOptions);
    s_response = response.Message;
}

This methods accomplishes the following tasks:

Sets up the ChatConfiguration.
Sets the ChatOptions, including the Temperature and the ResponseFormat.
Fetches the response to be evaluated by calling GetResponseAsync(IList<ChatMessage>, ChatOptions, CancellationToken), and stores it in a static variable.

Add the GetOllamaChatConfiguration method, which creates the IChatClient that the evaluator uses to communicate with the model.

private static ChatConfiguration GetOllamaChatConfiguration()
{
    // Get a chat client for the Ollama endpoint.
    IChatClient client =
        new OllamaChatClient(
            new Uri("http://localhost:11434"),
            modelId: "phi3:mini");

    return new ChatConfiguration(client);
}

Add a test method to evaluate the model's response.
```
[TestMethod]
public async Task TestCoherence()
{
    IEvaluator coherenceEvaluator = new CoherenceEvaluator();
    EvaluationResult result = await coherenceEvaluator.EvaluateAsync(
        s_messages,
        s_response,
        s_chatConfiguration);

    /// Retrieve the score for coherence from the <see cref="EvaluationResult"/>.
    NumericMetric coherence = result.Get<NumericMetric>(CoherenceEvaluator.CoherenceMetricName);

    // Validate the default interpretation
    // for the returned coherence metric.
    Assert.IsFalse(coherence.Interpretation!.Failed);
    Assert.IsTrue(coherence.Interpretation.Rating is EvaluationRating.Good or EvaluationRating.Exceptional);

    // Validate that no diagnostics are present
    // on the returned coherence metric.
    Assert.IsFalse(coherence.ContainsDiagnostics());
}
```
This method does the following:
- Invokes the CoherenceEvaluator to evaluate the coherence of the response. The EvaluateAsync(IEnumerable<ChatMessage>, ChatMessage, ChatConfiguration, IEnumerable<EvaluationContext>, CancellationToken) method returns an EvaluationResult that contains a NumericMetric. A NumericMetric contains a numeric value that's typically used to represent numeric scores that fall within a well-defined range.
- Retrieves the coherence score from the EvaluationResult.
- Validates the default interpretation for the returned coherence metric. Evaluators can include a default interpretation for the metrics they return. You can also change the default interpretation to suit your specific requirements, if needed.
- Validates that no diagnostics are present on the returned coherence metric. Evaluators can include diagnostics on the metrics they return to indicate errors, warnings, or other exceptional conditions encountered during evaluation.

Run the test/evaluation

Run the test using your preferred test workflow, for example, by using the CLI command dotnet test or through Test Explorer.

Next steps

Next, try evaluating against different models to see if the results change. Then, check out the extensive examples in the dotnet/ai-samples repo to see how to invoke multiple evaluators, add additional context, invoke a custom evaluator, attach diagnostics, or change the default interpretation of metrics.

Share via