Evaluate a model's response
In this quickstart, you create an MSTest app to evaluate the chat response of a model. The test app uses the Microsoft.Extensions.AI.Evaluation libraries.
Prerequisites
- Install .NET 8.0 or a later version
- Install Ollama locally on your machine
- Visual Studio Code (optional)
Run the local AI model
Complete the following steps to configure and run a local AI model on your device. For this quickstart, you'll use the general purpose phi3:mini
model, which is a small but capable generative AI created by Microsoft.
Open a terminal window and verify that Ollama is available on your device:
ollama
If Ollama is available, it displays a list of available commands.
Start Ollama:
ollama serve
If Ollama is running, it displays a list of available commands.
Pull the
phi3:mini
model from the Ollama registry and wait for it to download:ollama pull phi3:mini
After the download completes, run the model:
ollama run phi3:mini
Ollama starts the
phi3:mini
model and provides a prompt for you to interact with it.
Create the test app
Complete the following steps to create an MSTest project that connects to your local phi3:mini
AI model.
In a terminal window, navigate to the directory where you want to create your app, and create a new MSTest app with the
dotnet new
command:dotnet new mstest -o TestAI
Navigate to the
TestAI
directory, and add the necessary packages to your app:dotnet add package Microsoft.Extensions.AI.Ollama --prerelease dotnet add package Microsoft.Extensions.AI.Abstractions --prerelease dotnet add package Microsoft.Extensions.AI.Evaluation --prerelease dotnet add package Microsoft.Extensions.AI.Evaluation.Quality --prerelease
Open the new app in your editor of choice, such as Visual Studio Code.
code .
Add the test app code
Rename the file Test1.cs to MyTests.cs, and then open the file and rename the class to
MyTests
.Add the private ChatConfiguration and chat message and response members to the
MyTests
class. Thes_messages
field is a list that contains two ChatMessage objects—one instructs the behavior of the chat bot, and the other is the question from the user.private static ChatConfiguration? s_chatConfiguration; private static IList<ChatMessage> s_messages = [ new ChatMessage( ChatRole.System, """ You're an AI assistant that can answer questions related to astronomy. Keep your responses concise and try to stay under 100 words. Use the imperial measurement system for all measurements in your response. """), new ChatMessage( ChatRole.User, "How far is the planet Venus from Earth at its closest and furthest points?")]; private static ChatMessage s_response = new();
Add the
InitializeAsync
method to theMyTests
class.[ClassInitialize] public static async Task InitializeAsync(TestContext _) { /// Set up the <see cref="ChatConfiguration"/>, /// which includes the <see cref="IChatClient"/> that the /// evaluator uses to communicate with the model. s_chatConfiguration = GetOllamaChatConfiguration(); var chatOptions = new ChatOptions { Temperature = 0.0f, ResponseFormat = ChatResponseFormat.Text }; // Fetch the response to be evaluated // and store it in a static variable. ChatResponse response = await s_chatConfiguration.ChatClient.GetResponseAsync(s_messages, chatOptions); s_response = response.Message; }
This methods accomplishes the following tasks:
- Sets up the ChatConfiguration.
- Sets the ChatOptions, including the Temperature and the ResponseFormat.
- Fetches the response to be evaluated by calling GetResponseAsync(IList<ChatMessage>, ChatOptions, CancellationToken), and stores it in a static variable.
Add the
GetOllamaChatConfiguration
method, which creates the IChatClient that the evaluator uses to communicate with the model.private static ChatConfiguration GetOllamaChatConfiguration() { // Get a chat client for the Ollama endpoint. IChatClient client = new OllamaChatClient( new Uri("http://localhost:11434"), modelId: "phi3:mini"); return new ChatConfiguration(client); }
Add a test method to evaluate the model's response.
[TestMethod] public async Task TestCoherence() { IEvaluator coherenceEvaluator = new CoherenceEvaluator(); EvaluationResult result = await coherenceEvaluator.EvaluateAsync( s_messages, s_response, s_chatConfiguration); /// Retrieve the score for coherence from the <see cref="EvaluationResult"/>. NumericMetric coherence = result.Get<NumericMetric>(CoherenceEvaluator.CoherenceMetricName); // Validate the default interpretation // for the returned coherence metric. Assert.IsFalse(coherence.Interpretation!.Failed); Assert.IsTrue(coherence.Interpretation.Rating is EvaluationRating.Good or EvaluationRating.Exceptional); // Validate that no diagnostics are present // on the returned coherence metric. Assert.IsFalse(coherence.ContainsDiagnostics()); }
This method does the following:
- Invokes the CoherenceEvaluator to evaluate the coherence of the response. The EvaluateAsync(IEnumerable<ChatMessage>, ChatMessage, ChatConfiguration, IEnumerable<EvaluationContext>, CancellationToken) method returns an EvaluationResult that contains a NumericMetric. A
NumericMetric
contains a numeric value that's typically used to represent numeric scores that fall within a well-defined range. - Retrieves the coherence score from the EvaluationResult.
- Validates the default interpretation for the returned coherence metric. Evaluators can include a default interpretation for the metrics they return. You can also change the default interpretation to suit your specific requirements, if needed.
- Validates that no diagnostics are present on the returned coherence metric. Evaluators can include diagnostics on the metrics they return to indicate errors, warnings, or other exceptional conditions encountered during evaluation.
- Invokes the CoherenceEvaluator to evaluate the coherence of the response. The EvaluateAsync(IEnumerable<ChatMessage>, ChatMessage, ChatConfiguration, IEnumerable<EvaluationContext>, CancellationToken) method returns an EvaluationResult that contains a NumericMetric. A
Run the test/evaluation
Run the test using your preferred test workflow, for example, by using the CLI command dotnet test
or through Test Explorer.
Next steps
Next, try evaluating against different models to see if the results change. Then, check out the extensive examples in the dotnet/ai-samples repo to see how to invoke multiple evaluators, add additional context, invoke a custom evaluator, attach diagnostics, or change the default interpretation of metrics.