Azure OpenAI-evaluatie (preview)

Artikel
01/29/2025

De evaluatie van grote taalmodellen is een belangrijke stap bij het meten van de prestaties voor verschillende taken en dimensies. Dit is vooral belangrijk voor nauwkeurig afgestemde modellen, waarbij het beoordelen van de prestatieverbeteringen (of verliezen) van training cruciaal is. Grondige evaluaties kunnen u helpen inzicht te krijgen in hoe verschillende versies van het model van invloed kunnen zijn op uw toepassing of scenario.

Met azure OpenAI-evaluatie kunnen ontwikkelaars evaluatieuitvoeringen maken om te testen op basis van verwachte invoer-/uitvoerparen, de prestaties van het model beoordelen op basis van belangrijke metrische gegevens, zoals nauwkeurigheid, betrouwbaarheid en algehele prestaties.

Ondersteuning voor evaluaties

Regionale beschikbaarheid

VS - oost 2
VS - noord-centraal
Zweden - centraal
Zwitserland - west

Ondersteunde implementatietypen

Standaard
Algemene standaard
Standaard gegevenszone
Ingericht beheerd
Globaal ingericht beheerd
Gegevenszone ingericht beheerd

Evaluatiepijplijn

Testgegevens

U moet een basisgegevensset samenstellen waarop u wilt testen. Het maken van gegevenssets is doorgaans een iteratief proces dat ervoor zorgt dat uw evaluaties na verloop van tijd relevant blijven voor uw scenario's. Deze gegevensset voor de grondwaar is doorgaans handgeschreven en vertegenwoordigt het verwachte gedrag van uw model. De gegevensset wordt ook gelabeld en bevat de verwachte antwoorden.

Notitie

Voor sommige evaluatietests, zoals Sentiment en geldige JSON of XML , zijn geen grondwaargegevens vereist.

Uw gegevensbron moet de JSONL-indeling hebben. Hieronder ziet u twee voorbeelden van de JSONL-evaluatiegegevenssets:

{"question": "Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.", "subject": "abstract_algebra", "A": "0", "B": "4", "C": "2", "D": "6", "answer": "B", "completion": "B"}
{"question": "Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.", "subject": "abstract_algebra", "A": "8", "B": "2", "C": "24", "D": "120", "answer": "C", "completion": "C"}
{"question": "Find all zeros in the indicated finite field of the given polynomial with coefficients in that field. x^5 + 3x^3 + x^2 + 2x in Z_5", "subject": "abstract_algebra", "A": "0", "B": "1", "C": "0,1", "D": "0,4", "answer": "D", "completion": "D"}

Wanneer u uw evaluatiebestand uploadt en selecteert, wordt een voorbeeld van de eerste drie regels geretourneerd:

{"input": [{"role": "system", "content": "Provide a clear and concise summary of the technical content, highlighting key concepts and their relationships. Focus on the main ideas and practical implications."}, {"role": "user", "content": "Tokenization is a key step in preprocessing for natural language processing, involving the division of text into smaller components called tokens. These can be words, subwords, or characters, depending on the method chosen. Word tokenization divides text at word boundaries, while subword techniques like Byte Pair Encoding (BPE) or WordPiece can manage unknown words by breaking them into subunits. Character tokenization splits text into individual characters, useful for multiple languages and misspellings. The tokenization method chosen greatly affects model performance and its capacity to handle various languages and vocabularies."}], "output": "Tokenization divides text into smaller units (tokens) for NLP applications, using word, subword (e.g., BPE), or character methods. Each has unique benefits, impacting model performance and language processing capabilities."}      
{"input": [{"role": "system", "content": "Create a comprehensive yet concise summary that captures the essential technical details and significance of the topic."}, {"role": "user", "content": "Self-attention mechanisms are vital in modern transformer models, allowing them to evaluate the relevance of different parts of an input sequence when processing each element. In self-attention, each position in a sequence learns to focus on all other positions, forming a weighted sum of their values. Attention weights are derived using queries, keys, and values, with compatibility between a query and key determining attention to each value. This mechanism enables capturing long-range dependencies and relationships within a sequence, making it effective for tasks needing context understanding and inter-part relationships."}], "output": "Self-attention is crucial in transformers, enabling each sequence position to evaluate and collect data from all others. It uses queries, keys, and values for attention weights, effectively capturing long-range dependencies and contextual links."}      
{"input": [{"role": "system", "content": "Synthesize the technical information into a brief, accurate summary that maintains the core concepts and their significance."}, {"role": "user", "content": "The transformer architecture significantly changed natural language processing by introducing a new method for sequence processing. Unlike recurrent neural networks, transformers process whole sequences at once using self-attention. The architecture consists of an encoder and decoder, each with multiple layers of multi-head attention, feed-forward networks, and normalization layers. The encoder processes the input sequence, and the decoder produces the output sequence. Positional encodings are added to retain sequence order information since the model lacks inherent sequence position awareness. This architecture underlies many leading language models like BERT and GPT."}], "output": "Transformers are a transformative neural architecture processing sequences with self-attention instead of recurrence. Their encoder-decoder design, with multi-head attention and position encodings, is fundamental to modern language models like BERT and GPT."}

Wanneer u een evaluatiebestand uploadt en selecteert, wordt een voorbeeld van de eerste drie regels geretourneerd:

U kunt bestaande eerder geüploade gegevenssets kiezen of een nieuwe gegevensset uploaden.

Antwoorden maken (optioneel)

De prompt die u in uw evaluatie gebruikt, moet overeenkomen met de prompt die u in productie wilt gebruiken. Deze aanwijzingen bevatten de instructies die het model moet volgen. Net als bij de speeltuinervaringen kunt u meerdere invoerwaarden maken om enkele shotvoorbeelden in uw prompt op te nemen. Zie prompt engineeringtechnieken voor meer informatie over enkele geavanceerde technieken in promptontwerp en prompt engineering.

U kunt verwijzen naar uw invoergegevens binnen de prompts met behulp van de {{input.column_name}} indeling, waarbij column_name overeenkomt met de namen van de kolommen in uw invoerbestand.

Uitvoer die tijdens de evaluatie wordt gegenereerd, wordt in de volgende stappen verwezen met behulp van de {{sample.output_text}} indeling.

Notitie

U moet dubbele accolades gebruiken om ervoor te zorgen dat u naar uw gegevens verwijst.

Modelimplementatie

Als onderdeel van het maken van evaluaties kiest u welke modellen moeten worden gebruikt bij het genereren van antwoorden (optioneel) en welke modellen moeten worden gebruikt bij het beoordelen van modellen met specifieke testcriteria.

In Azure OpenAI wijst u specifieke modelimplementaties toe die moeten worden gebruikt als onderdeel van uw evaluaties. U kunt meerdere modelimplementaties vergelijken in één evaluatieuitvoering.

U kunt basis- of nauwkeurig afgestemde modelimplementaties evalueren. De implementaties die beschikbaar zijn in uw lijst, zijn afhankelijk van de implementaties die u in uw Azure OpenAI-resource hebt gemaakt. Als u de gewenste implementatie niet kunt vinden, kunt u een nieuwe maken op de azure OpenAI-evaluatiepagina.

Testcriteria

Testcriteria worden gebruikt om de effectiviteit te beoordelen van elke uitvoer die door het doelmodel wordt gegenereerd. Met deze tests worden de invoergegevens vergeleken met de uitvoergegevens om consistentie te garanderen. U hebt de flexibiliteit om verschillende criteria te configureren om de kwaliteit en relevantie van de uitvoer op verschillende niveaus te testen en te meten.

Aan de slag

Selecteer de Azure OpenAI-evaluatie (PREVIEW) in de Azure AI Foundry-portal. Als u deze weergave wilt zien als een optie, moet u mogelijk eerst een bestaande Azure OpenAI-resource selecteren in een ondersteunde regio.
Nieuwe evaluatie selecteren
Voer een naam in voor uw evaluatie. Standaard wordt automatisch een willekeurige naam gegenereerd, tenzij u deze bewerkt en vervangt. Selecteer Nieuwe gegevensset uploaden.

Selecteer uw evaluatie die de .jsonl indeling heeft. Als u een voorbeeldtestbestand nodig hebt, kunt u deze tien regels opslaan in een bestand met de naam eval-test.jsonl:

{"input": [{"role": "system", "content": "Provide a clear and concise summary of the technical content, highlighting key concepts and their relationships. Focus on the main ideas and practical implications."}, {"role": "user", "content": "Tokenization is a key step in preprocessing for natural language processing, involving the division of text into smaller components called tokens. These can be words, subwords, or characters, depending on the method chosen. Word tokenization divides text at word boundaries, while subword techniques like Byte Pair Encoding (BPE) or WordPiece can manage unknown words by breaking them into subunits. Character tokenization splits text into individual characters, useful for multiple languages and misspellings. The tokenization method chosen greatly affects model performance and its capacity to handle various languages and vocabularies."}], "output": "Tokenization divides text into smaller units (tokens) for NLP applications, using word, subword (e.g., BPE), or character methods. Each has unique benefits, impacting model performance and language processing capabilities."}      
{"input": [{"role": "system", "content": "Create a comprehensive yet concise summary that captures the essential technical details and significance of the topic."}, {"role": "user", "content": "Self-attention mechanisms are vital in modern transformer models, allowing them to evaluate the relevance of different parts of an input sequence when processing each element. In self-attention, each position in a sequence learns to focus on all other positions, forming a weighted sum of their values. Attention weights are derived using queries, keys, and values, with compatibility between a query and key determining attention to each value. This mechanism enables capturing long-range dependencies and relationships within a sequence, making it effective for tasks needing context understanding and inter-part relationships."}], "output": "Self-attention is crucial in transformers, enabling each sequence position to evaluate and collect data from all others. It uses queries, keys, and values for attention weights, effectively capturing long-range dependencies and contextual links."}      
{"input": [{"role": "system", "content": "Synthesize the technical information into a brief, accurate summary that maintains the core concepts and their significance."}, {"role": "user", "content": "The transformer architecture significantly changed natural language processing by introducing a new method for sequence processing. Unlike recurrent neural networks, transformers process whole sequences at once using self-attention. The architecture consists of an encoder and decoder, each with multiple layers of multi-head attention, feed-forward networks, and normalization layers. The encoder processes the input sequence, and the decoder produces the output sequence. Positional encodings are added to retain sequence order information since the model lacks inherent sequence position awareness. This architecture underlies many leading language models like BERT and GPT."}], "output": "Transformers are a transformative neural architecture processing sequences with self-attention instead of recurrence. Their encoder-decoder design, with multi-head attention and position encodings, is fundamental to modern language models like BERT and GPT."}      
{"input": [{"role": "system", "content": "Distill the technical content into a clear, concise summary that preserves the most important concepts and their practical applications."}, {"role": "user", "content": "Cosine similarity is a metric for measuring similarity between vectors by calculating the cosine of the angle between them. The outcome ranges from -1 to 1, with 1 indicating identical orientation, 0 orthogonality, and -1 opposite orientation. In machine learning and NLP, cosine similarity is useful for comparing document vectors, word embeddings, or any high-dimensional representations. It's often preferred over Euclidean distance because it's invariant to vector magnitude, focusing instead on orientation. This makes it particularly useful for comparing documents or embeddings of various lengths or scales."}], "output": "Cosine similarity evaluates vector similarity by computing their angle's cosine, with values from -1 to 1. It's widely used in ML and NLP for comparing document vectors and embeddings due to its magnitude-invariant nature."}      
{"input": [{"role": "system", "content": "Provide a succinct summary that captures the key technical concepts and their relationships while maintaining clarity and accuracy."}, {"role": "user", "content": "Neural network activation functions are crucial for introducing non-linearity to deep learning models. Commonly used activation functions include ReLU (Rectified Linear Unit), which outputs the input directly if positive and zero otherwise, sigmoid, which compresses values between 0 and 1, and tanh, which normalizes values between -1 and 1. Each function has specific benefits and applications: ReLU avoids vanishing gradients and enables sparse activation, sigmoid is useful for binary classification, and tanh is often used in hidden layers for its zero-centered output. The activation function choice greatly affects model training dynamics and ultimate performance."}], "output": "Activation functions introduce non-linearity to neural networks, with ReLU, sigmoid, and tanh being the most common. Each serves distinct purposes: ReLU prevents vanishing gradients, sigmoid is suitable for binary classification, and tanh offers zero-centered output, all affecting model performance differently."}      
{"input": [{"role": "system", "content": "Create a focused summary that emphasizes the main technical concepts and their practical implications while maintaining accuracy."}, {"role": "user", "content": "Batch normalization is a technique that stabilizes and accelerates deep neural network training by normalizing each layer's inputs. It normalizes activations using the mean and variance computed over the current data batch. This normalization reduces internal covariate shift, where the input distribution changes during training. Batch normalization includes learnable parameters (gamma and beta) allowing the network to reverse normalization if necessary. Benefits include enabling higher learning rates, lessening the importance of precise parameter initialization, and serving as a regularizer, potentially removing the need for dropout."}], "output": "Batch normalization stabilizes neural network training by normalizing layer inputs using batch statistics. It addresses internal covariate shift with learnable parameters, allowing higher learning rates and improved training dynamics, while offering regularization benefits."}      
{"input": [{"role": "system", "content": "Summarize the technical content while preserving the essential concepts and their interconnections, ensuring clarity and precision."}, {"role": "user", "content": "Gradient descent optimization algorithms are foundational to training machine learning models. The basic version updates parameters in the steepest descent direction of the loss function. Variants like Stochastic Gradient Descent (SGD) use random samples to estimate gradients, enhancing computational efficiency. Advanced optimizers like Adam combine momentum (helping overcome local minima) and RMSprop (adapting learning rates for each parameter). These optimizers maintain running averages of gradients and their squares, enabling adaptive learning rates and more efficient training. The optimizer choice can significantly influence model convergence speed and final performance."}], "output": "Gradient descent algorithms optimize ML models by updating parameters based on loss gradients. Variants like SGD use sampling for efficiency, while advanced optimizers like Adam combine momentum with adaptive learning rates for improved training performance."}      
{"input": [{"role": "system", "content": "Deliver a concise yet comprehensive summary that highlights the key technical aspects and their practical significance."}, {"role": "user", "content": "Cross-validation is a resampling method used to evaluate machine learning models on limited data. The most common form, k-fold cross-validation, involves dividing the dataset into k equal parts, training the model on k-1 folds while validating on the remaining fold, and rotating through all combinations. This provides a more robust model performance assessment than a single train-test split, helping detect overfitting and estimate model generalization to unseen data. Variations include stratified k-fold (maintaining class distribution), leave-one-out (k equals dataset size), and time series cross-validation for temporal data."}], "output": "Cross-validation evaluates ML models by training and testing on different data splits, typically using k-fold methodology. This approach offers better performance assessment than single splits, with variations for different data types and requirements."}      
{"input": [{"role": "system", "content": "Generate a clear and focused summary that captures the essential technical details while maintaining their relationships and significance."}, {"role": "user", "content": "Transfer learning is a machine learning method where a model developed for one task is reused as the starting point for a model on a second task. This approach is powerful in deep learning, where pre-trained models on large datasets (like ImageNet for computer vision or BERT for NLP) are fine-tuned on specific downstream tasks. Transfer learning reduces the need for large amounts of task-specific training data and computational resources, as the model has already learned useful features from the source domain. Common strategies include feature extraction (freezing pre-trained layers) and fine-tuning (updating all or some pre-trained weights)."}], "output": "Transfer learning reuses models trained on one task for different tasks, particularly effective in deep learning. It leverages pre-trained models through feature extraction or fine-tuning, reducing data and computational needs for new tasks."}      
{"input": [{"role": "system", "content": "Provide a precise and informative summary that distills the key technical concepts while maintaining their relationships and practical importance."}, {"role": "user", "content": "Ensemble methods combine multiple machine learning models to create a more robust and accurate predictor. Common techniques include bagging (training models on random data subsets), boosting (sequentially training models to correct earlier errors), and stacking (using a meta-model to combine base model predictions). Random Forests, a popular bagging method, create multiple decision trees using random feature subsets. Gradient Boosting builds trees sequentially, with each tree correcting the errors of previous ones. These methods often outperform single models by reducing overfitting and variance while capturing different data aspects."}], "output": "Ensemble methods enhance prediction accuracy by combining multiple models through techniques like bagging, boosting, and stacking. Popular implementations include Random Forests (using multiple trees with random features) and Gradient Boosting (sequential error correction), offering better performance than single models."}

U ziet de eerste drie regels van het bestand als voorbeeld:

Selecteer onder Antwoorden de knop Maken . Selecteer {{item.input}} in de vervolgkeuzelijst Maken met een sjabloon . Hiermee worden de invoervelden uit ons evaluatiebestand ingevoerd in afzonderlijke prompts voor een nieuwe modeluitvoering die we willen vergelijken met onze evaluatiegegevensset. Het model neemt die invoer en genereert zijn eigen unieke uitvoer die in dit geval wordt opgeslagen in een variabele met de naam {{sample.output_text}}. Vervolgens gebruiken we die voorbeelduitvoertekst later als onderdeel van onze testcriteria. U kunt ook handmatig uw eigen aangepaste systeembericht en afzonderlijke berichtvoorbeelden opgeven.
Selecteer welk model u wilt genereren op basis van uw evaluatie. Als u geen model hebt, kunt u er een maken. Voor dit voorbeeld gebruiken we een standaardimplementatie van gpt-4o-mini.

Het instellingen-/tandwielsymbool bepaalt de basisparameters die aan het model worden doorgegeven. Op dit moment worden alleen de volgende parameters ondersteund:

Temperatuur
Maximumlengte
Bovenste P

De maximale lengte wordt momenteel beperkt tot 2048, ongeacht het model dat u selecteert.

Selecteer Testcriteria toevoegen, selecteer Toevoegen.
Selecteer Semantic Similarity> Under Compare add {{item.output}} under With add {{sample.output_text}}. Hiermee wordt de oorspronkelijke referentie-uitvoer uit uw evaluatiebestand .jsonl gebruikt en vergeleken met de uitvoer die wordt gegenereerd door de modelprompts te geven op basis van uw {{item.input}}.
Selecteer Toevoegen> op dit punt, u kunt aanvullende testcriteria toevoegen of u selecteer Maken om de uitvoering van de evaluatietaak te starten.
Zodra u Maken hebt geselecteerd, wordt u naar een statuspagina voor uw evaluatietaak geleid.
Zodra uw evaluatietaak is gemaakt, kunt u de taak selecteren om de volledige details van de taak weer te geven:
Voor semantische overeenkomsten bevat uitvoerdetails een JSON-weergave die u kunt kopiëren/plakken van uw geslaagde tests.

Details van testcriteria

Azure OpenAI-evaluatie biedt meerdere testcriteriaopties. In de onderstaande sectie vindt u aanvullende informatie over elke optie.

Feitelijkheid

Beoordeelt de feitelijke nauwkeurigheid van een ingediend antwoord door het te vergelijken met een deskundig antwoord.

De feitelijkheid evalueert de feitelijke nauwkeurigheid van een ingediend antwoord door het te vergelijken met een deskundig antwoord. Door gebruik te maken van een gedetailleerde coT-prompt (chain-of-thought), bepaalt de grader of het ingediende antwoord consistent is met, een subset van, een superset van of in conflict is met het antwoord van de expert. Het negeert verschillen in stijl, grammatica of interpunctie, waarbij uitsluitend aandacht wordt besteed aan feitelijke inhoud. Feitelijkheid kan nuttig zijn in veel scenario's, waaronder maar niet beperkt tot inhoudsverificatie en educatieve hulpprogramma's die de nauwkeurigheid van antwoorden van AI garanderen.

U kunt de prompttekst weergeven die wordt gebruikt als onderdeel van deze testcriteria door de vervolgkeuzelijst naast de prompt te selecteren. De huidige prompttekst is:

Prompt
You are comparing a submitted answer to an expert answer on a given question.
Here is the data:
[BEGIN DATA]
************
[Question]: {input}
************
[Expert]: {ideal}
************
[Submission]: {completion}
************
[END DATA]
Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.

Semantische gelijkenis

Meet de mate van gelijkenis tussen het antwoord van het model en de verwijzing. Grades: 1 (completely different) - 5 (very similar).

Gevoel

Pogingen om de emotionele toon van de uitvoer te identificeren.

U kunt de prompttekst weergeven die wordt gebruikt als onderdeel van deze testcriteria door de vervolgkeuzelijst naast de prompt te selecteren. De huidige prompttekst is:

Prompt
You will be presented with a text generated by a large language model. Your job is to rate the sentiment of the text. Your options are:

A) Positive
B) Neutral
C) Negative
D) Unsure

[BEGIN TEXT]
***
[{text}]
***
[END TEXT]

First, write out in a step by step manner your reasoning about the answer to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character (without quotes or punctuation) on its own line corresponding to the correct answer. At the end, repeat just the letter again by itself on a new line

Tekenreekscontrole

Controleert of de uitvoer exact overeenkomt met de verwachte tekenreeks.

Met een tekenreekscontrole worden verschillende binaire bewerkingen uitgevoerd op twee tekenreeksvariabelen, waardoor diverse evaluatiecriteria mogelijk zijn. Het helpt bij het verifiëren van verschillende tekenreeksrelaties, waaronder gelijkheid, insluiting en specifieke patronen. Deze evaluator maakt hoofdlettergevoelige of hoofdlettergevoelige vergelijkingen mogelijk. Het biedt ook opgegeven cijfers voor waar- of onwaarresultaten, waardoor aangepaste evaluatieresultaten worden toegestaan op basis van het vergelijkingsresultaat. Dit is het type bewerkingen dat wordt ondersteund:

equals: Controleert of de uitvoertekenreeks exact gelijk is aan de evaluatietekenreeks.
contains: Controleert of de evaluatietekenreeks een subtekenreeks van de uitvoertekenreeks is.
starts-with: Controleert of de uitvoertekenreeks begint met de evaluatietekenreeks.
ends-with: Controleert of de uitvoertekenreeks eindigt op de evaluatietekenreeks.

Notitie

Wanneer u bepaalde parameters instelt in uw testcriteria, kunt u kiezen tussen de variabele en de sjabloon. Selecteer een variabele als u naar een kolom in uw invoergegevens wilt verwijzen. Kies een sjabloon als u een vaste tekenreeks wilt opgeven.

Geldige JSON of XML

Controleert of de uitvoer geldige JSON of XML is.

Komt overeen met het schema

Controleert of de uitvoer de opgegeven structuur volgt.

Criteriaovereenkomst

Bepaal of het antwoord van het model overeenkomt met uw criteria. Cijfer: slagen of mislukken.

U kunt de prompttekst weergeven die wordt gebruikt als onderdeel van deze testcriteria door de vervolgkeuzelijst naast de prompt te selecteren. De huidige prompttekst is:

Prompt
Your job is to assess the final response of an assistant based on conversation history and provided criteria for what makes a good response from the assistant. Here is the data:

[BEGIN DATA]
***
[Conversation]: {conversation}
***
[Response]: {response}
***
[Criteria]: {criteria}
***
[END DATA]

Does the response meet the criteria? First, write out in a step by step manner your reasoning about the criteria to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer. "Y" for yes if the response meets the criteria, and "N" for no if it does not. At the end, repeat just the letter again by itself on a new line.
Reasoning:

Tekstkwaliteit

Evalueer de kwaliteit van tekst door te vergelijken met verwijzingstekst.

Summary:

BLEU Score: Evalueert de kwaliteit van gegenereerde tekst door deze te vergelijken met een of meer hoogwaardige referentievertalingen met behulp van de BLEU-score.
ROUGE Score: evalueert de kwaliteit van gegenereerde tekst door deze te vergelijken met referentieoverzichten met behulp van ROUGE-scores.
Cosinus: ook wel cosinus-overeenkomsten genoemd, meet hoe dicht twee tekstinsluitingen, zoals modeluitvoer en referentieteksten, in betekenis worden uitgelijnd, zodat de semantische gelijkenis tussen beide wordt beoordeeld. Dit wordt gedaan door hun afstand in vectorruimte te meten.

Details:

DE SCORE VOOR DE BLAUW-evaluatie (BiUlatl Evaluation Understudy) wordt vaak gebruikt in natuurlijke taalverwerking (NLP) en machinevertaling. Deze wordt veel gebruikt in gebruiksvoorbeelden voor tekstsamenvatting en het genereren van tekst. Hiermee wordt geëvalueerd hoe dicht de gegenereerde tekst overeenkomt met de verwijzingstekst. De BLUE score varieert van 0 tot 1, met hogere scores die een betere kwaliteit aangeven.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is een set metrische gegevens die worden gebruikt om automatische samenvatting en automatische vertaling te evalueren. Hiermee wordt de overlap tussen gegenereerde tekst en referentieoverzichten berekend. ROUGE richt zich op relevante maatregelen om te beoordelen hoe goed de gegenereerde tekst de verwijzingstekst bedekt. De ROUGE-score biedt verschillende metrische gegevens, waaronder: • ROUGE-1: Overlap van unigrammen (enkele woorden) tussen gegenereerde en referentietekst. • ROUGE-2: Overlapping van bigrams (twee opeenvolgende woorden) tussen gegenereerde en verwijzingstekst. • ROUGE-3: Overlapping van trigrammen (drie opeenvolgende woorden) tussen gegenereerde en referentietekst. • ROUGE-4: Overlap van vier gram (vier opeenvolgende woorden) tussen gegenereerde en verwijzingstekst. • ROUGE-5: Overlap van vijf gram (vijf opeenvolgende woorden) tussen gegenereerde en referentietekst. • ROUGE-L: Overlapping van L-grammen (L opeenvolgende woorden) tussen gegenereerde en verwijzingstekst. Tekstsamenvatting en documentvergelijking zijn een van de optimale gebruiksvoorbeelden voor ROUGE, met name in scenario's waarin tekstcoherentie en relevantie essentieel zijn.

Cosinus-overeenkomsten meten hoe dicht twee tekstinsluitingen, zoals modeluitvoer en referentieteksten, in betekenis worden uitgelijnd, zodat de semantische gelijkenis tussen beide wordt beoordeeld. Hetzelfde als andere op modellen gebaseerde evaluators, moet u een modelimplementatie bieden die wordt gebruikt voor evaluatie.

Belangrijk

Alleen het insluiten van modellen wordt ondersteund voor deze evaluator:

text-embedding-3-small
text-embedding-3-large
text-embedding-ada-002

Aangepaste prompt

Hiermee wordt het model gebruikt om de uitvoer te classificeren in een set opgegeven labels. Deze evaluator gebruikt een aangepaste prompt die u moet definiëren.

Delen via

Azure OpenAI-evaluatie (preview)

Ondersteuning voor evaluaties

Regionale beschikbaarheid

Ondersteunde implementatietypen

Evaluatiepijplijn

Testgegevens

Evaluatie-indeling

Antwoorden maken (optioneel)

Modelimplementatie

Testcriteria

Aan de slag

Details van testcriteria

Feitelijkheid

Semantische gelijkenis

Gevoel

Tekenreekscontrole

Geldige JSON of XML

Komt overeen met het schema

Criteriaovereenkomst

Tekstkwaliteit

Aangepaste prompt

Feedback

Aanvullende resources