評估集合
重要
這項功能處於公開預覽狀態。
若要測量代理應用程序的品質,您必須能夠定義一組具代表性的要求,以及描述高品質回應的準則。 您可以藉由提供評估集來執行此動作。 本文涵蓋評估集的各種選項,以及建立評估集的一些最佳做法。
Databricks 建議建立人類標記的評估集,其中包含代表性問題和地面答案。 如果您的應用程式包含擷取步驟,您可以選擇性地提供預期回應所依據的支持檔。 雖然建議使用人為標記的評估集,但代理程式評估與綜合產生的評估集運作同樣良好。
良好的評估集具有以下特點:
- 代表:它應該準確地反映應用程式在生產環境中遇到的要求範圍。
- 具有挑戰性:它應該包含困難且多樣化的案例,以有效測試應用程式功能的完整範圍。
- 持續更新:應該定期更新,以反映應用程式的使用方式,以及生產流量變更模式。
如需評估集的必要架構,請參閱 代理程式評估輸入架構。
範例評估集
本節包含評估集的簡單範例。
僅含 request
的範例評估集
eval_set = [
{
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
}
]
搭配 request
和 expected_response
的範例評估集
eval_set = [
{
"request_id": "request-id",
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"expected_response": "There's no significant difference.",
}
]
使用 request
、expected_response
和 expected_retrieved_content
的範例評估集
eval_set = [
{
"request_id": "request-id",
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"expected_retrieved_context": [
{
"doc_uri": "doc_uri_1",
},
{
"doc_uri": "doc_uri_2",
},
],
"expected_response": "There's no significant difference.",
}
]
僅限 request
和 response
的範例評估集
eval_set = [
{
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
}
]
使用 request
、response
和 retrieved_context
的範例評估集
eval_set = [
{
"request_id": "request-id", # optional, but useful for tracking
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
"retrieved_context": [
{
# In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
"content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
"doc_uri": "doc_uri_2_1",
},
{
"content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
"doc_uri": "doc_uri_6_extra",
},
],
}
]
使用 request
、response
、retrieved_context
和 expected_response
的範例評估集
eval_set = [
{
"request_id": "request-id",
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"expected_response": "There's no significant difference.",
"response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
"retrieved_context": [
{
# In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
"content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
"doc_uri": "doc_uri_2_1",
},
{
"content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
"doc_uri": "doc_uri_6_extra",
},
],
}
]
使用 request
、response
、retrieved_context
、expected_response
和 expected_retrieved_context
的範例評估集
level_4_data = [
{
"request_id": "request-id",
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"expected_retrieved_context": [
{
"doc_uri": "doc_uri_2_1",
},
{
"doc_uri": "doc_uri_2_2",
},
],
"expected_response": "There's no significant difference.",
"response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
"retrieved_context": [
{
# In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
"content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
"doc_uri": "doc_uri_2_1",
},
{
"content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
"doc_uri": "doc_uri_6_extra",
},
],
}
]
開發評估集的最佳做法
- 將評估集中的每個範例或樣本群組視為單元測試。 也就是說,每個範例都應該對應至具有明確預期結果的特定案例。 例如,請考慮測試較長的內容、多重躍點推理,以及從間接辨識項推斷答案的能力。
- 請考慮測試來自惡意使用者的對立案例。
- 評估集中未包含問題數目的特定方針,但來自高品質資料的明確訊號通常比弱式資料的嘈雜訊號更好。
- 請考慮包含非常具有挑戰性的範例,即使是人類也能回答。
- 無論您是建置一般用途應用程式或以特定網域為目標,您的應用程式都可能會遇到各種不同的問題。 評估集應該反映這一點。 例如,如果您要建立應用程式來輸入特定的 HR 問題,您仍應該考慮測試其他網域 (例如作業),以確保應用程式不會產生幻覺或提供有害的回覆。
- 高品質、一致的人為產生的標籤是確保您提供給應用程式的基礎真相值正確反映所需行為的最佳方式。 確保高品質人類標籤的一些步驟如下:
- 匯總來自多個人類標籤人員的回覆 (標籤),以取得相同的問題。
- 請確定標籤指示清楚,且人類標籤人員是一致的。
- 請確定人為標籤程序的條件與提交至 RAG 應用程式的要求格式相同。
- 人類標籤人員本質上是嘈雜和不一致的,例如,由於對問題的不同解釋。 這是程式的重要部分。 使用人類標籤可以揭示您未考慮的問題解譯,而且可能會讓您深入瞭解應用程式中觀察到的行為。