Discrepancies in Evaluation Metrics for Identical Outputs in Azure AI Studio Using GPT-3.5-Turbo Prompt Flows

Question

Hi,

I have a question about evaluation metrics for question-answering with context using GPT-3.5-turbo in Prompt Flow. I created two prompt flows to generate model answers and assess performance metrics. Both flows have identical structures, with the only difference being the deployment name.

User's image

I fine-tuned two different model deployments from the same base model (gpt-3.5-turbo-0125) using identical settings but different training datasets. For evaluation, I used the same test set across both prompt flows and deployment names.

User's image

After the evaluation, the generated answers were identical between the two models without any space or special characters, and all inputs like question, context, and ground truth are the same as both evaluation pipelines use the same file. However, the metrics like coherence, fluency, groundedness, similarity, and relevance showed significant differences.

User's image

According to the documentation, these metrics don't require a model as input. For example, Relevance only requires question, context, and model's answer.

User's image

Could you help me identify if I might have missed something or why the model seems to impact the metric calculation even all the inputs are identical?

Answer

Although the outputs from both models are identical, it’s possible that the evaluation pipelines have small differences in configurations or versions that might be influencing the metric calculations. For instance, settings like threshold values or weights for certain metrics might differ slightly across the two deployments, leading to different results even with the same answers.

Some evaluation metrics, especially those involving NLP, may introduce a small degree of randomness in their computation. For example, metrics like coherence and fluency might be subject to inherent variance depending on how the scoring algorithms are implemented.

Even though the generated answers are identical, there might be hidden context, metadata, or other factors being passed into the evaluation pipeline, for example, fine-tuning metadata or deployment-specific attributes might still affect the evaluation metrics.

Share via

Discrepancies in Evaluation Metrics for Identical Outputs in Azure AI Studio Using GPT-3.5-Turbo Prompt Flows

1 answer

Your answer