Discrepancies in Evaluation Metrics for Identical Outputs in Azure AI Studio Using GPT-3.5-Turbo Prompt Flows

Kornkamol 0 Reputation points
2024-09-04T12:02:43.39+00:00

Hi,

I have a question about evaluation metrics for question-answering with context using GPT-3.5-turbo in Prompt Flow. I created two prompt flows to generate model answers and assess performance metrics. Both flows have identical structures, with the only difference being the deployment name.

User's image

I fine-tuned two different model deployments from the same base model (gpt-3.5-turbo-0125) using identical settings but different training datasets. For evaluation, I used the same test set across both prompt flows and deployment names.

User's image

After the evaluation, the generated answers were identical between the two models without any space or special characters, and all inputs like question, context, and ground truth are the same as both evaluation pipelines use the same file. However, the metrics like coherence, fluency, groundedness, similarity, and relevance showed significant differences.

User's imageUser's image

According to the documentation, these metrics don't require a model as input. For example, Relevance only requires question, context, and model's answer.

User's image

Could you help me identify if I might have missed something or why the model seems to impact the metric calculation even all the inputs are identical?

Azure Monitor
Azure Monitor
An Azure service that is used to collect, analyze, and act on telemetry data from Azure and on-premises environments.
3,396 questions
Azure AI Metrics Advisor
Azure AI Metrics Advisor
An Azure artificial intelligence analytics service that proactively monitors metrics and diagnoses issues.
82 questions
Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
3,451 questions
Azure AI Language
Azure AI Language
An Azure service that provides natural language capabilities including sentiment analysis, entity extraction, and automated question answering.
432 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
3,000 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 27,446 Reputation points
    2024-09-06T17:24:03.52+00:00

    Although the outputs from both models are identical, it’s possible that the evaluation pipelines have small differences in configurations or versions that might be influencing the metric calculations. For instance, settings like threshold values or weights for certain metrics might differ slightly across the two deployments, leading to different results even with the same answers.

    Some evaluation metrics, especially those involving NLP, may introduce a small degree of randomness in their computation. For example, metrics like coherence and fluency might be subject to inherent variance depending on how the scoring algorithms are implemented.

    Even though the generated answers are identical, there might be hidden context, metadata, or other factors being passed into the evaluation pipeline, for example, fine-tuning metadata or deployment-specific attributes might still affect the evaluation metrics.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.