자습서: 3부 - Azure AI Foundry SDK를 사용하여 사용자 지정 채팅 애플리케이션 평가

아티클
02/25/2025

이 자습서에서는 Azure AI SDK(및 기타 라이브러리)를 사용하여 자습서 시리즈의 2부에서 빌드한 채팅 앱을 평가합니다. 이 3부에서는 다음을 수행하는 방법을 알아봅니다.

평가 데이터 세트 만들기
Azure AI 평가자를 사용하여 채팅 앱 평가
앱 반복 및 개선

이 자습서는 3부로 구성된 자습서 중 제3부입니다.

필수 조건

자습서 시리즈의 2부를 완료하여 채팅 애플리케이션을 빌드합니다.
2부에서 원격 분석 로깅을 추가하는 단계를 완료했는지 확인합니다.

채팅 앱 응답의 품질 평가

이제 채팅 기록을 사용하는 것을 포함하여 채팅 앱이 쿼리에 잘 응답한다는 것을 알게 되었으므로 이제 몇 가지 다른 메트릭과 더 많은 데이터에서 채팅 앱이 어떻게 작동하는지 평가할 차례입니다.

평가 데이터 세트와 get_chat_response() 대상 함수가 있는 평가기를 사용한 다음 평가 결과를 평가합니다.

평가를 실행하면 시스템 프롬프트를 개선하고 채팅 앱 응답이 어떻게 변경되고 개선되는지 관찰하는 등 논리를 개선할 수 있습니다.

평가 데이터 세트 만들기

예시 질문과 예상 답변(truth)이 포함된 다음 평가 데이터 세트를 사용합니다.

자산 폴더에 chat_eval_data.jsonl이라는 파일을 만듭니다.

다음 데이터 세트를 파일에 붙여넣습니다.

{"query": "Which tent is the most waterproof?", "truth": "The Alpine Explorer Tent has the highest rainfly waterproof rating at 3000m"}
{"query": "Which camping table holds the most weight?", "truth": "The Adventure Dining Table has a higher weight capacity than all of the other camping tables mentioned"}
{"query": "How much do the TrailWalker Hiking Shoes cost? ", "truth": "The Trailewalker Hiking Shoes are priced at $110"}
{"query": "What is the proper care for trailwalker hiking shoes? ", "truth": "After each use, remove any dirt or debris by brushing or wiping the shoes with a damp cloth."}
{"query": "What brand is TrailMaster tent? ", "truth": "OutdoorLiving"}
{"query": "How do I carry the TrailMaster tent around? ", "truth": " Carry bag included for convenient storage and transportation"}
{"query": "What is the floor area for Floor Area? ", "truth": "80 square feet"}
{"query": "What is the material for TrailBlaze Hiking Pants?", "truth": "Made of high-quality nylon fabric"}
{"query": "What color does TrailBlaze Hiking Pants come in?", "truth": "Khaki"}
{"query": "Can the warrenty for TrailBlaze pants be transfered? ", "truth": "The warranty is non-transferable and applies only to the original purchaser of the TrailBlaze Hiking Pants. It is valid only when the product is purchased from an authorized retailer."}
{"query": "How long are the TrailBlaze pants under warranty for? ", "truth": " The TrailBlaze Hiking Pants are backed by a 1-year limited warranty from the date of purchase."}
{"query": "What is the material for PowerBurner Camping Stove? ", "truth": "Stainless Steel"}
{"query": "Is France in Europe?", "truth": "Sorry, I can only queries related to outdoor/camping gear and equipment"}

Azure AI 평가자를 사용하여 평가

이제 다음을 수행할 평가 스크립트를 정의합니다.

채팅 앱 논리를 중심으로 대상 함수 래퍼를 생성합니다.
샘플 .jsonl 데이터 세트를 로드합니다.
대상 함수를 사용하고 평가 데이터 세트를 채팅 앱의 응답과 병합하는 평가를 실행합니다.
채팅 앱 응답의 품질을 평가하기 위해 GPT 지원 메트릭 집합(관련성, 접지성 및 일관성)을 생성합니다.
결과를 로컬로 출력하고 결과를 클라우드 프로젝트에 기록합니다.

이 스크립트를 사용하면 명령줄 및 json 파일에 결과를 출력하여 결과를 로컬로 검토할 수 있습니다.

또한 이 스크립트는 UI에서 평가 실행을 비교할 수 있도록 평가 결과를 클라우드 프로젝트에 기록합니다.

주 폴더에 evaluate.py 파일을 만듭니다.

다음 코드를 추가하여 필요한 라이브러리를 가져오고, 프로젝트 클라이언트를 만들고, 일부 설정을 구성합니다.

import os
import pandas as pd
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import ConnectionType
from azure.ai.evaluation import evaluate, GroundednessEvaluator
from azure.identity import DefaultAzureCredential

from chat_with_products import chat_with_products

# load environment variables from the .env file at the root of this repo
from dotenv import load_dotenv

load_dotenv()

# create a project client using environment variables loaded from the .env file
project = AIProjectClient.from_connection_string(
    conn_str=os.environ["AIPROJECT_CONNECTION_STRING"], credential=DefaultAzureCredential()
)

connection = project.connections.get_default(connection_type=ConnectionType.AZURE_OPEN_AI, include_credentials=True)

evaluator_model = {
    "azure_endpoint": connection.endpoint_url,
    "azure_deployment": os.environ["EVALUATION_MODEL"],
    "api_version": "2024-06-01",
    "api_key": connection.key,
}

groundedness = GroundednessEvaluator(evaluator_model)

쿼리 및 응답 평가를 위해 평가 인터페이스를 구현하는 래퍼 함수를 만드는 코드를 추가합니다.

def evaluate_chat_with_products(query):
    response = chat_with_products(messages=[{"role": "user", "content": query}])
    return {"response": response["message"].content, "context": response["context"]["grounding_data"]}

마지막으로, 평가를 실행하고, 결과를 로컬로 보고, Azure AI Foundry 포털에서 평가 결과에 대한 링크를 제공하는 코드를 추가합니다.

# Evaluate must be called inside of __main__, not on import
if __name__ == "__main__":
    from config import ASSET_PATH

    # workaround for multiprocessing issue on linux
    from pprint import pprint
    from pathlib import Path
    import multiprocessing
    import contextlib

    with contextlib.suppress(RuntimeError):
        multiprocessing.set_start_method("spawn", force=True)

    # run evaluation with a dataset and target function, log to the project
    result = evaluate(
        data=Path(ASSET_PATH) / "chat_eval_data.jsonl",
        target=evaluate_chat_with_products,
        evaluation_name="evaluate_chat_with_products",
        evaluators={
            "groundedness": groundedness,
        },
        evaluator_config={
            "default": {
                "query": {"${data.query}"},
                "response": {"${target.response}"},
                "context": {"${target.context}"},
            }
        },
        azure_ai_project=project.scope,
        output_path="./myevalresults.json",
    )

    tabular_result = pd.DataFrame(result.get("rows"))

    pprint("-----Summarized Metrics-----")
    pprint(result["metrics"])
    pprint("-----Tabular Result-----")
    pprint(tabular_result)
    pprint(f"View evaluation results in AI Studio: {result['studio_url']}")

평가 모델 구성

평가 스크립트는 모델을 여러 번 호출하므로 평가 모델의 분당 토큰 수를 늘릴 수 있습니다.

이 자습서 시리즈의 1부에서는 평가 모델의 gpt-4o-mini이름을 지정하는 .env 파일을 만들었습니다. 사용 가능한 할당량이 있는 경우 이 모델에 대한 분당 토큰 제한을 늘려 보세요. 값을 늘릴 할당량이 충분하지 않은 경우 걱정하지 마세요. 스크립트는 제한 오류를 처리하도록 설계되었습니다.

Azure AI Foundry 포털의 프로젝트에서 모델 + 엔드포인트를 선택합니다.
gpt-4o-mini를 선택합니다.
편집을 선택합니다.
분당 토큰을 늘릴 할당량이 있는 경우 30 이상으로 늘려 보세요.
저장 후 닫기를 선택합니다.

평가 스크립트 실행

콘솔에서 Azure CLI를 사용하여 Azure 계정에 로그인합니다.
```
az login
```

필요한 패키지를 설치합니다.

pip install azure-ai-evaluation[remote]

이제 평가 스크립트를 실행합니다.
```
python evaluate.py
```

평가를 완료하는 데 몇 분 정도 걸릴 것으로 예상합니다.

평가 출력 해석

콘솔 출력에는 각 질문에 대한 답변과 요약된 메트릭이 있는 테이블이 표시됩니다. (출력에 다른 열이 표시될 수 있습니다.)

모델에 대한 분당 토큰 제한을 늘릴 수 없는 경우 예상되는 몇 가지 시간 제한 오류가 표시될 수 있습니다. 평가 스크립트는 이러한 오류를 처리하고 계속 실행하도록 설계되었습니다.

참고 항목

많은 항목이 표시 WARNING:opentelemetry.attributes: 될 수도 있습니다. 이러한 항목은 안전하게 무시될 수 있으며 평가 결과에 영향을 미치지 않습니다.

====================================================
'-----Summarized Metrics-----'
{'groundedness.gpt_groundedness': 1.6666666666666667,
 'groundedness.groundedness': 1.6666666666666667}
'-----Tabular Result-----'
                                     outputs.response  ... line_number
0   Could you specify which tent you are referring...  ...           0
1   Could you please specify which camping table y...  ...           1
2   Sorry, I only can answer queries related to ou...  ...           2
3   Could you please clarify which aspects of care...  ...           3
4   Sorry, I only can answer queries related to ou...  ...           4
5   The TrailMaster X4 Tent comes with an included...  ...           5
6                                            (Failed)  ...           6
7   The TrailBlaze Hiking Pants are crafted from h...  ...           7
8   Sorry, I only can answer queries related to ou...  ...           8
9   Sorry, I only can answer queries related to ou...  ...           9
10  Sorry, I only can answer queries related to ou...  ...          10
11  The PowerBurner Camping Stove is designed with...  ...          11
12  Sorry, I only can answer queries related to ou...  ...          12

[13 rows x 8 columns]
('View evaluation results in Azure AI Foundry portal: '
 'https://xxxxxxxxxxxxxxxxxxxxxxx')

Azure AI Foundry 포털에서 평가 결과 보기

평가 실행이 완료되면 링크를 따라 Azure AI Foundry 포털의 평가 페이지에서 평가 결과를 확인합니다.

스크린샷은 Azure AI Foundry 포털의 평가 개요를 보여줍니다.

개별 행을 보고, 행당 메트릭 점수를 확인하며, 검색된 전체 컨텍스트/문서를 볼 수도 있습니다. 이러한 메트릭은 평가 결과를 해석하고 디버깅하는 데 유용할 수 있습니다.

스크린샷은 Azure AI Foundry 포털의 평가 결과 행을 보여줍니다.

Azure AI Foundry 포털의 평가 결과에 대한 자세한 내용은 Azure AI Foundry 포털에서 평가 결과를 보는 방법을 참조 하세요.

반복 및 개선

응답이 잘 접지되지 않습니다. 대부분의 경우 모델은 답변이 아닌 질문으로 응답합니다. 프롬프트 템플릿 지침의 결과입니다.

assets/grounded_chat.prompty 파일에서 "질문이 야외/캠핑 장비 및 의류와 관련이 없는 경우 '죄송합니다. 야외/캠핑 장비 및 의류와 관련된 쿼리에만 대답할 수 있습니다. 그래서, 어떻게 도울 수 있습니까?'"
문장을 "질문이 야외/캠핑 장비 및 의류와 관련이 있지만 모호한 경우 참조 문서에 따라 답변한 다음 명확한 질문을 요청합니다."
파일을 저장하고 평가 스크립트를 다시 실행합니다.

프롬프트 템플릿에 대한 다른 수정을 시도하거나 다른 모델을 시도하여 변경 내용이 평가 결과에 미치는 영향을 확인합니다.

리소스 정리

불필요한 Azure 비용이 발생하지 않도록 하려면 이 자습서에서 만든 리소스가 더 이상 필요하지 않은 경우 삭제해야 합니다. 리소스를 관리하려면 Azure Portal을 사용할 수 있습니다.

Azure AI Foundry SDK에 대해 자세히 알아보기

다음을 통해 공유

자습서: 3부 - Azure AI Foundry SDK를 사용하여 사용자 지정 채팅 애플리케이션 평가

필수 조건

채팅 앱 응답의 품질 평가

평가 데이터 세트 만들기

Azure AI 평가자를 사용하여 평가

평가 모델 구성

평가 스크립트 실행

평가 출력 해석

Azure AI Foundry 포털에서 평가 결과 보기

반복 및 개선

리소스 정리

피드백

추가 리소스

다음을 통해 공유

자습서: 3부 - Azure AI Foundry SDK를 사용하여 사용자 지정 채팅 애플리케이션 평가

필수 조건

채팅 앱 응답의 품질 평가

평가 데이터 세트 만들기

Azure AI 평가자를 사용하여 평가

평가 모델 구성

평가 스크립트 실행

평가 출력 해석

Azure AI Foundry 포털에서 평가 결과 보기

반복 및 개선

리소스 정리

관련 콘텐츠

피드백

추가 리소스