教學課程：第 3 部分 - 使用 Azure AI Foundry SDK 評估自定義聊天應用程式

發行項
02/25/2025

在本教學課程中，您會使用 Azure AI SDK（和其他連結庫）來評估您在教學課程系列的第 2 部分中建置的聊天應用程式。在這第三個部分中，您會了解如何：

建立評估數據集
使用 Azure AI 評估工具評估聊天應用程式
反覆運算並改善您的應用程式

本教學課程是三部分教學課程的第三部分。

必要條件

完成教學課程系列的第 2 部分，以建置聊天應用程式。
請確定您已完成從第2部分新增遙測記錄的步驟。

評估聊天應用程式回應的品質

現在您已知道您的聊天應用程式對查詢有良好的回應 (包括聊天記錄)，是時候來評估其在跨不同計量和更多資料上的表現。

您可以使用評估工具搭配評估數據集和 get_chat_response() 目標函式，然後評估評估結果。

執行評估之後，您就可以改善邏輯，例如改善系統提示，以及觀察聊天應用程式回應如何變更與改善。

建立評估資料集

使用下列評估資料集，其中包含範例問題與預期的答案 (事實)。

在您的 assets 資料夾中建立名為 chat_eval_data.jsonl 的檔案。

將此資料集貼到檔案中：

{"query": "Which tent is the most waterproof?", "truth": "The Alpine Explorer Tent has the highest rainfly waterproof rating at 3000m"}
{"query": "Which camping table holds the most weight?", "truth": "The Adventure Dining Table has a higher weight capacity than all of the other camping tables mentioned"}
{"query": "How much do the TrailWalker Hiking Shoes cost? ", "truth": "The Trailewalker Hiking Shoes are priced at $110"}
{"query": "What is the proper care for trailwalker hiking shoes? ", "truth": "After each use, remove any dirt or debris by brushing or wiping the shoes with a damp cloth."}
{"query": "What brand is TrailMaster tent? ", "truth": "OutdoorLiving"}
{"query": "How do I carry the TrailMaster tent around? ", "truth": " Carry bag included for convenient storage and transportation"}
{"query": "What is the floor area for Floor Area? ", "truth": "80 square feet"}
{"query": "What is the material for TrailBlaze Hiking Pants?", "truth": "Made of high-quality nylon fabric"}
{"query": "What color does TrailBlaze Hiking Pants come in?", "truth": "Khaki"}
{"query": "Can the warrenty for TrailBlaze pants be transfered? ", "truth": "The warranty is non-transferable and applies only to the original purchaser of the TrailBlaze Hiking Pants. It is valid only when the product is purchased from an authorized retailer."}
{"query": "How long are the TrailBlaze pants under warranty for? ", "truth": " The TrailBlaze Hiking Pants are backed by a 1-year limited warranty from the date of purchase."}
{"query": "What is the material for PowerBurner Camping Stove? ", "truth": "Stainless Steel"}
{"query": "Is France in Europe?", "truth": "Sorry, I can only queries related to outdoor/camping gear and equipment"}

使用 Azure AI 評估工具進行評估

現在定義評估指令碼，它將：

在聊天應用程式邏輯周圍產生目標函式包裝函式。
載入範例 .jsonl 資料集。
執行評估，其採用目標函式，並將評估資料集與來自聊天應用程式的回應合併。
產生一組 GPT 輔助的計量 (相關性、根據性和連貫性)，以評估聊天應用程式回應的品質。
在本機輸出結果，並將結果記錄至雲端專案。

指令碼允許您將結果輸出至命令列和 JSON 檔案，讓您在本機檢閱結果。

指令碼也會將評估結果記錄至雲端專案，以便您比較在 UI 中執行的評估。

在主資料夾中建立名為 evaluate.py 的 檔案。

新增下列程式代碼以匯入必要的連結庫、建立專案用戶端，以及設定一些設定：

import os
import pandas as pd
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import ConnectionType
from azure.ai.evaluation import evaluate, GroundednessEvaluator
from azure.identity import DefaultAzureCredential

from chat_with_products import chat_with_products

# load environment variables from the .env file at the root of this repo
from dotenv import load_dotenv

load_dotenv()

# create a project client using environment variables loaded from the .env file
project = AIProjectClient.from_connection_string(
    conn_str=os.environ["AIPROJECT_CONNECTION_STRING"], credential=DefaultAzureCredential()
)

connection = project.connections.get_default(connection_type=ConnectionType.AZURE_OPEN_AI, include_credentials=True)

evaluator_model = {
    "azure_endpoint": connection.endpoint_url,
    "azure_deployment": os.environ["EVALUATION_MODEL"],
    "api_version": "2024-06-01",
    "api_key": connection.key,
}

groundedness = GroundednessEvaluator(evaluator_model)

新增程式代碼以建立包裝函式，以實作查詢和回應評估的評估介面：

def evaluate_chat_with_products(query):
    response = chat_with_products(messages=[{"role": "user", "content": query}])
    return {"response": response["message"].content, "context": response["context"]["grounding_data"]}

最後，新增程式代碼以執行評估、在本機檢視結果，並提供您在 Azure AI Foundry 入口網站中評估結果的連結：

# Evaluate must be called inside of __main__, not on import
if __name__ == "__main__":
    from config import ASSET_PATH

    # workaround for multiprocessing issue on linux
    from pprint import pprint
    from pathlib import Path
    import multiprocessing
    import contextlib

    with contextlib.suppress(RuntimeError):
        multiprocessing.set_start_method("spawn", force=True)

    # run evaluation with a dataset and target function, log to the project
    result = evaluate(
        data=Path(ASSET_PATH) / "chat_eval_data.jsonl",
        target=evaluate_chat_with_products,
        evaluation_name="evaluate_chat_with_products",
        evaluators={
            "groundedness": groundedness,
        },
        evaluator_config={
            "default": {
                "query": {"${data.query}"},
                "response": {"${target.response}"},
                "context": {"${target.context}"},
            }
        },
        azure_ai_project=project.scope,
        output_path="./myevalresults.json",
    )

    tabular_result = pd.DataFrame(result.get("rows"))

    pprint("-----Summarized Metrics-----")
    pprint(result["metrics"])
    pprint("-----Tabular Result-----")
    pprint(tabular_result)
    pprint(f"View evaluation results in AI Studio: {result['studio_url']}")

設定評估模型

由於評估腳本會呼叫模型多次，因此您可能會想要增加評估模型每分鐘令牌的數目。

在本教學課程系列的第 1 部分中，您已建立一個 .env 檔案，指定評估模型 gpt-4o-mini的名稱。如果您有可用的配額，請嘗試增加此模型的每分鐘令牌限制。如果您沒有足夠的配額可增加此值，請不要擔心。腳本的設計目的是要處理限制錯誤。

在 Azure AI Foundry 入口網站的專案中，選取 [ 模型 + 端點]。
選取 gpt-4o-mini。
選取編輯。
如果您有配額可增加 每分鐘令牌速率限制，請嘗試將其增加至 30 或更新版本。
選取儲存後關閉。

執行評估指令碼

從您的主控台，使用 Azure CLI 登入您的 Azure 帳戶：
```
az login
```

安裝必要的套件：

pip install azure-ai-evaluation[remote]

現在執行評估指令碼：
```
python evaluate.py
```

預期評估需要幾分鐘的時間才能完成。

解譯評估輸出

在主控台輸出中，您會看到每個問題的解答，後面接著包含摘要計量的數據表。 (您可能會在輸出中看到不同的資料行。)

如果您無法增加模型每分鐘令牌的限制，您可能會看到一些預期的逾時錯誤。評估腳本的設計目的是要處理這些錯誤並繼續執行。

注意

您也可能看到許多 WARNING:opentelemetry.attributes: - 這些可以安全地忽略，且不會影響評估結果。

====================================================
'-----Summarized Metrics-----'
{'groundedness.gpt_groundedness': 1.6666666666666667,
 'groundedness.groundedness': 1.6666666666666667}
'-----Tabular Result-----'
                                     outputs.response  ... line_number
0   Could you specify which tent you are referring...  ...           0
1   Could you please specify which camping table y...  ...           1
2   Sorry, I only can answer queries related to ou...  ...           2
3   Could you please clarify which aspects of care...  ...           3
4   Sorry, I only can answer queries related to ou...  ...           4
5   The TrailMaster X4 Tent comes with an included...  ...           5
6                                            (Failed)  ...           6
7   The TrailBlaze Hiking Pants are crafted from h...  ...           7
8   Sorry, I only can answer queries related to ou...  ...           8
9   Sorry, I only can answer queries related to ou...  ...           9
10  Sorry, I only can answer queries related to ou...  ...          10
11  The PowerBurner Camping Stove is designed with...  ...          11
12  Sorry, I only can answer queries related to ou...  ...          12

[13 rows x 8 columns]
('View evaluation results in Azure AI Foundry portal: '
 'https://xxxxxxxxxxxxxxxxxxxxxxx')

在 Azure AI Foundry 入口網站中檢視評估結果

評估執行完成後，請遵循連結，在 Azure AI Foundry 入口網站的 [評估] 頁面上檢視評估結果。

此螢幕快照顯示 Azure AI Foundry 入口網站中的評估概觀。

您也可以查看個別資料列和每個資料列的計量分數，並檢視已擷取的完整內容/文件。這些計量有助於解譯和偵錯評估結果。

此螢幕快照顯示 Azure AI Foundry 入口網站中評估結果的數據列。

如需 Azure AI Foundry 入口網站中評估結果的詳細資訊，請參閱如何在 Azure AI Foundry 入口網站中檢視評估結果。

反覆運算和改善

請注意，回應並未妥善進行。在許多情況下，模型會以問題而非答案回復。這是提示範本指示的結果。

在您的 assets/grounded_chat.prompty 檔案中，尋找句子「如果問題與戶外/露營裝備和服裝無關，只要說『很抱歉，我只能回答與戶外/露營裝備和服裝相關的查詢。那麼，我該如何説明？
將句子變更為「如果問題與戶外/露營裝備和服裝有關，但模糊不清，請嘗試根據參考檔回答，然後要求澄清問題。
儲存盤案並重新執行評估腳本。

嘗試對提示範本進行其他修改，或嘗試不同的模型，以查看變更如何影響評估結果。

清除資源

為了避免產生不必要的 Azure 費用，如果您不再需要在本教學課程中建立的資源，則應加以刪除。若要管理資源，您可以使用 Azure 入口網站。

深入瞭解 Azure AI Foundry SDK

共用方式為

教學課程：第 3 部分 - 使用 Azure AI Foundry SDK 評估自定義聊天應用程式

必要條件

評估聊天應用程式回應的品質

建立評估資料集

使用 Azure AI 評估工具進行評估

設定評估模型

執行評估指令碼

解譯評估輸出

在 Azure AI Foundry 入口網站中檢視評估結果

反覆運算和改善

清除資源

意見反應

其他資源

共用方式為

教學課程：第 3 部分 - 使用 Azure AI Foundry SDK 評估自定義聊天應用程式

必要條件

評估聊天應用程式回應的品質

建立評估資料集

使用 Azure AI 評估工具進行評估

設定評估模型

執行評估指令碼

解譯評估輸出

在 Azure AI Foundry 入口網站中檢視評估結果

反覆運算和改善

清除資源

相關內容

意見反應

其他資源