자습서: 관련성 최대화(Azure AI Search의 RAG)

아티클
11/19/2024

이 자습서에서는 RAG 솔루션에 사용되는 검색 결과의 관련성을 개선하는 방법을 알아봅니다. 관련성 튜닝은 사용자의 기대를 충족하는 RAG 솔루션을 제공하는 데 중요한 요소가 될 수 있습니다. Azure AI Search에서 관련성 튜닝에는 L2 의미 체계 순위 및 점수 매기기 프로필이 포함됩니다.

이러한 기능을 구현하려면 인덱스 스키마를 다시 검토하여 의미 체계 순위 및 점수 매기기 프로필에 대한 구성을 추가합니다. 그런 다음 새 구문을 사용하여 쿼리를 다시 실행합니다.

이 자습서에서는 사용할 기존 검색 인덱스 및 쿼리를 수정합니다.

L2 의미 체계 순위
문서 승격을 위한 점수 매기기 프로필

이 자습서에서는 인덱싱 파이프라인에서 만든 검색 인덱스를 업데이트합니다. 업데이트는 기존 콘텐츠에 영향을 주지 않으므로 다시 빌드할 필요가 없으며 인덱서 다시 실행할 필요가 없습니다.

참고 항목

미리 보기에는 벡터 쿼리 가중치 및 최소 임계값 설정을 포함하여 더 많은 관련성 기능이 있지만 미리 보기 상태이므로 이 자습서에서 생략합니다.

필수 조건

Python 확장 및 Jupyter 패키지가 있는 Visual Studio Code.
Azure OPENAI 및 Azure AI Services와 동일한 지역에서 관리 ID 및 의미 체계 순위에 대한 기본 계층 이상인 Azure AI Search
Azure OPENAI는 Azure AI Search와 동일한 지역에 text-embedding-002 및 gpt-35-turbo를 배포합니다.

샘플 다운로드

샘플 Notebook에는 업데이트된 인덱스 및 쿼리 요청이 포함됩니다.

비교를 위해 기준 쿼리 실행

"바다와 큰 수역과 관련된 구름 형성이 있습니까?"라는 새로운 쿼리로 시작해 보겠습니다.

관련성 기능을 추가한 후 결과를 비교하려면 의미 체계 순위 또는 점수 매기기 프로필을 추가하기 전에 기존 인덱스 스키마에 대해 쿼리를 실행합니다.

Azure Government 클라우드의 경우 토큰 공급자의 API 엔드포인트를 수정합니다 "https://cognitiveservices.azure.us/.default".

from azure.search.documents import SearchClient
from openai import AzureOpenAI

token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")
openai_client = AzureOpenAI(
     api_version="2024-06-01",
     azure_endpoint=AZURE_OPENAI_ACCOUNT,
     azure_ad_token_provider=token_provider
 )

deployment_name = "gpt-4o"

search_client = SearchClient(
     endpoint=AZURE_SEARCH_SERVICE,
     index_name=index_name,
     credential=credential
 )

GROUNDED_PROMPT="""
You are an AI assistant that helps users learn from the information found in the source material.
Answer the query using only the sources provided below.
Use bullets if the answer has multiple points.
If the answer is longer than 3 sentences, provide a summary.
Answer ONLY with the facts listed in the list of sources below. Cite your source when you answer the question
If there isn't enough information below, say you don't know.
Do not generate answers that don't use the sources below.
Query: {query}
Sources:\n{sources}
"""

# Focused query on cloud formations and bodies of water
query="Are there any cloud formations specific to oceans and large bodies of water?"
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=50, fields="text_vector")

search_results = search_client.search(
    search_text=query,
    vector_queries= [vector_query],
    select=["title", "chunk", "locations"],
    top=5,
)

sources_formatted = "=================\n".join([f'TITLE: {document["title"]}, CONTENT: {document["chunk"]}, LOCATIONS: {document["locations"]}' for document in search_results])

response = openai_client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": GROUNDED_PROMPT.format(query=query, sources=sources_formatted)
        }
    ],
    model=deployment_name
)

print(response.choices[0].message.content)

이 요청의 출력은 다음 예제와 같을 수 있습니다.

Yes, there are cloud formations specific to oceans and large bodies of water. 
A notable example is "cloud streets," which are parallel rows of clouds that form over 
the Bering Strait in the Arctic Ocean. These cloud streets occur when wind blows from 
a cold surface like sea ice over warmer, moister air near the open ocean, leading to 
the formation of spinning air cylinders. Clouds form along the upward cycle of these cylinders, 
while skies remain clear along the downward cycle (Source: page-21.pdf).

의미 체계 순위 및 점수 매기기 프로필에 대한 인덱스 업데이트

이전 자습서 에서는 RAG 워크로드에 대한 인덱스 스키마 를 디자인했습니다. 기본 사항에 집중할 수 있도록 해당 스키마에서 관련성 향상을 의도적으로 생략했습니다. 별도의 연습에 대한 관련성을 연기하면 업데이트 후 검색 결과의 품질을 전후로 비교할 수 있습니다.

의미 체계 순위 및 점수 매기기 프로필에 대한 클래스를 포함하도록 import 문을 업데이트합니다.

 from azure.identity import DefaultAzureCredential
 from azure.identity import get_bearer_token_provider
 from azure.search.documents.indexes import SearchIndexClient
 from azure.search.documents.indexes.models import (
     SearchField,
     SearchFieldDataType,
     VectorSearch,
     HnswAlgorithmConfiguration,
     VectorSearchProfile,
     AzureOpenAIVectorizer,
     AzureOpenAIVectorizerParameters,
     SearchIndex,
     SemanticConfiguration,
     SemanticPrioritizedFields,
     SemanticField,
     SemanticSearch,
     ScoringProfile,
     TagScoringFunction,
     TagScoringParameters
 )

검색 인덱스로 다음 의미 체계 구성을 추가합니다. 이 예제는 Notebook의 업데이트 스키마 단계에서 찾을 수 있습니다.

# New semantic configuration
semantic_config = SemanticConfiguration(
    name="my-semantic-config",
    prioritized_fields=SemanticPrioritizedFields(
        title_field=SemanticField(field_name="title"),
        keywords_fields=[SemanticField(field_name="locations")],
        content_fields=[SemanticField(field_name="chunk")]
    )
)

# Create the semantic settings with the configuration
semantic_search = SemanticSearch(configurations=[semantic_config])

의미 체계 구성에는 의미 체계 순위에 대한 입력을 최적화하는 데 도움이 되는 이름과 우선 순위가 지정된 필드 목록이 있습니다. 자세한 내용은 의미 체계 순위 구성을 참조하세요.

다음으로 점수 매기기 프로필 정의를 추가합니다. 의미 체계 구성과 마찬가지로 점수 매기기 프로필은 언제든지 인덱스 스키마에 추가할 수 있습니다. 이 예제는 의미 체계 구성에 따라 Notebook의 업데이트 스키마 단계에도 있습니다.
```
# New scoring profile
scoring_profiles = [  
    ScoringProfile(  
        name="my-scoring-profile",
        functions=[
            TagScoringFunction(  
                field_name="locations",  
                boost=5.0,  
                parameters=TagScoringParameters(  
                    tags_parameter="tags",  
                ),  
            ) 
        ]
    )
]
```
이 프로필은 위치 필드에서 일치하는 항목이 발견된 문서의 점수를 높이는 태그 함수를 사용합니다. 검색 인덱스에는 제목, 청크 및 위치에 대한 벡터 필드와 여러 개의 비벡터 필드가 있습니다. 위치 필드는 문자열 컬렉션이며, 점수 매기기 프로필의 태그 함수를 사용하여 문자열 컬렉션을 높일 수 있습니다. 자세한 내용은 문서 승격을 사용하여 점수 매기기 프로필 추가 및 검색 관련성 향상(블로그 게시물)을 참조하세요.

검색 서비스에서 인덱스 정의를 업데이트합니다.

# Update the search index with the semantic configuration
 index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search, semantic_search=semantic_search, scoring_profiles=scoring_profiles)  
 result = index_client.create_or_update_index(index)  
 print(f"{result.name} updated")

의미 체계 순위 및 점수 매기기 프로필에 대한 쿼리 업데이트

이전 자습서 에서는 검색 엔진에서 실행되는 쿼리를 실행하여 채팅 완료를 위해 응답 및 기타 정보를 LLM에 전달했습니다.

이 예제에서는 의미 체계 구성 및 점수 매기기 프로필을 포함하도록 쿼리 요청을 수정합니다.

Azure Government 클라우드의 경우 토큰 공급자의 API 엔드포인트를 수정합니다 "https://cognitiveservices.azure.us/.default".

# Import libraries
from azure.search.documents import SearchClient
from openai import AzureOpenAI

token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")
openai_client = AzureOpenAI(
     api_version="2024-06-01",
     azure_endpoint=AZURE_OPENAI_ACCOUNT,
     azure_ad_token_provider=token_provider
 )

deployment_name = "gpt-4o"

search_client = SearchClient(
     endpoint=AZURE_SEARCH_SERVICE,
     index_name=index_name,
     credential=credential
 )

# Prompt is unchanged in this update
GROUNDED_PROMPT="""
You are an AI assistant that helps users learn from the information found in the source material.
Answer the query using only the sources provided below.
Use bullets if the answer has multiple points.
If the answer is longer than 3 sentences, provide a summary.
Answer ONLY with the facts listed in the list of sources below.
If there isn't enough information below, say you don't know.
Do not generate answers that don't use the sources below.
Query: {query}
Sources:\n{sources}
"""

# Queries are unchanged in this update
query="Are there any cloud formations specific to oceans and large bodies of water?"
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=50, fields="text_vector")

# Add query_type semantic and semantic_configuration_name
# Add scoring_profile and scoring_parameters
search_results = search_client.search(
    query_type="semantic",
    semantic_configuration_name="my-semantic-config",
    scoring_profile="my-scoring-profile",
    scoring_parameters=["tags-ocean, 'sea surface', seas, surface"],
    search_text=query,
    vector_queries= [vector_query],
    select="title, chunk, locations",
    top=5,
)
sources_formatted = "=================\n".join([f'TITLE: {document["title"]}, CONTENT: {document["chunk"]}, LOCATIONS: {document["locations"]}' for document in search_results])

response = openai_client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": GROUNDED_PROMPT.format(query=query, sources=sources_formatted)
        }
    ],
    model=deployment_name
)

print(response.choices[0].message.content)

의미 체계적으로 순위가 매겨진 쿼리 및 승격된 쿼리의 출력은 다음 예제와 같이 표시될 수 있습니다.

Yes, there are specific cloud formations influenced by oceans and large bodies of water:

- **Stratus Clouds Over Icebergs**: Low stratus clouds can frame holes over icebergs, 
such as Iceberg A-56 in the South Atlantic Ocean, likely due to thermal instability caused 
by the iceberg (source: page-39.pdf).

- **Undular Bores**: These are wave structures in the atmosphere created by the collision 
of cool, dry air from a continent with warm, moist air over the ocean, as seen off the 
coast of Mauritania (source: page-23.pdf).

- **Ship Tracks**: These are narrow clouds formed by water vapor condensing around tiny 
particles from ship exhaust. They are observed over the oceans, such as in the Pacific Ocean 
off the coast of California (source: page-31.pdf).

These specific formations are influenced by unique interactions between atmospheric conditions 
and the presence of large water bodies or objects within them.

의미 체계 순위 및 점수 매기기 프로필을 추가하면 점수 매기기 조건을 충족하고 의미상 관련된 결과를 승격하여 LLM의 응답에 긍정적인 영향을 줍니다.

이제 인덱스 및 쿼리 디자인을 더 잘 이해했으므로 속도 및 간결성 최적화로 넘어가겠습니다. 스키마 정의를 다시 검토하여 양자화 및 스토리지 감소를 구현하지만 나머지 파이프라인과 모델은 그대로 유지됩니다.

다음 단계

벡터 스토리지 및 비용 최소화

다음을 통해 공유