依檔版面配置或結構進行區塊和向量化

發行項
11/24/2024

注意

此功能目前處於公開預覽。此預覽版是在沒有服務等級協定的情況下提供，不建議用於生產工作負載。可能不支援特定功能，或可能已經限制功能。如需詳細資訊，請參閱 Microsoft Azure 預覽版增補使用條款。

文字數據區塊化策略在優化RAG回應和效能方面扮演重要角色。藉由使用目前處於預覽狀態的新檔版面配置技能，您可以根據文件結構將內容區塊化、擷取標題，並根據語意一致性將內容本文區塊化，例如段落和句子。區塊會獨立處理。由於 LLM 會使用多個區塊，因此當這些區塊品質較高且語意一致時，查詢的整體相關性就會改善。

檔版面配置技能會在文件智慧中呼叫版面配置模型。此模型會使用 Markdown 語法（標題和內容）來表達 JSON 中的內容結構，並在 Azure AI 搜尋上儲存在搜尋索引中的標題和內容欄位。從檔版面配置技能產生的可搜尋內容是純文本，但您可以套用整合式向量化，為源檔中的任何字段產生內嵌，包括影像。

在本文中，了解如何：

使用檔版面配置技能來辨識文件結構
使用文字分割技能來限制每個 Markdown 區段的區塊大小
產生每個區塊的內嵌
使用索引投影將內嵌對應至搜尋索引中的欄位

基於圖例目的，本文使用上傳至 Azure Blob 儲存體的範例健康情況計劃 PDF，然後使用匯入和向量化數據精靈編製索引。

必要條件

索引器型索引管線，其中包含接受輸出的索引。索引必須有欄位才能接收標題和內容。
支持的數據源，包含您想要區塊的文字內容。
具有檔版面配置技能的技能集，可根據段落界限分割檔。
產生向量內嵌的 Azure OpenAI 內嵌技能。
一對多索引編製的索引投影。

準備數據檔

原始輸入必須位於支持的數據源中，而且檔案必須是檔版面配置技能支援的格式。

支援的檔格式包括：PDF、JPEG、JPG、PNG、BMP、TIFF、DOCX、XLSX、PPTX、HTML。
支援的索引器可以是任何可處理支援檔案格式的索引器。這些索引器包括 Blob 索引器、 OneLake 索引器、檔案索引器。
此功能的支援區域包括：美國東部、美國西部 2、西歐、美國中北部。請務必檢查此清單，以取得區域可用性的更新。

您可以使用 Azure 入口網站、REST API 或 Azure SDK 套件來建立數據源。

提示

將健康情況計劃 PDF 範例檔案上傳至支持的數據源，以在您自己的搜尋服務上試用檔版面配置技能和結構感知區塊。匯入和向量化數據精靈是一種簡單的無程序代碼方法，可用於試用此技能。請務必選取 預設剖析模式 ，以使用結構感知區塊處理。否則，會改用 Markdown 剖析模式。

建立一對多索引的索引

以下是以區塊設計之單一搜尋檔的範例承載。每當您使用區塊時，您需要區塊欄位和可識別區塊原點的父欄位。在此範例中，父欄位是text_parent_id。子欄位是 Markdown 區段的向量和非向量區塊。

檔版面配置技能會輸出標題和內容。在此範例中， header_1 透過 header_3 儲存文件標題，如技能偵測到。其他內容，例如段落，會儲存在中 chunk。欄位 text_vector 是區塊欄位內容的向量表示法。

您可以使用 Azure 入口網站、REST API 或 Azure SDK 中的匯入和向量化數據精靈來建立索引。下列索引與精靈預設建立的內容非常類似。如果您新增影像向量化，您可能會有更多的欄位。

如果您未使用精靈，則索引必須存在於搜尋服務上，才能建立技能集或執行索引器。

{
  "name": "my_consolidated_index",
  "fields": [
    {
      "name": "chunk_id",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": true,
      "facetable": false,
      "key": true,
      "analyzer": "keyword"
    },
    {
      "name": "text_parent_id",
      "type": "Edm.String",
      "searchable": false,
      "filterable": true,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false
    },
    {
      "name": "chunk",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false
    },
    {
      "name": "title",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false
    },
    {
      "name": "header_1",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false
    },
    {
      "name": "header_2",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false
    },
    {
      "name": "header_3",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false
    },
    {
      "name": "text_vector",
      "type": "Collection(Edm.Single)",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "dimensions": 1536,
      "stored": false,
      "vectorSearchProfile": "profile"
    }
  ],
  "vectorSearch": {
    "profiles": [
      {
        "name": "profile",
        "algorithm": "algorithm"
      }
    ],
    "algorithms": [
      {
        "name": "algorithm",
        "kind": "hnsw"
      }
    ]
  }
}

定義結構感知區塊化和向量化的技能集

由於檔版面配置技能處於預覽狀態，因此您必須針對此步驟使用 Create Skillset 2024-11-01-preview REST API。您也可以使用 Azure 入口網站。

本節顯示技能集定義的範例，該定義會投影個別 Markdown 區段、區塊及其向量對等專案，做為搜尋索引中的字段。它會使用檔版面配置技能，根據源檔中語意一致的段落和句子來偵測標題並填入內容欄位。它會使用文字分割技能將 Markdown 內容分割成區塊。它會使用 Azure OpenAI 內嵌技能來向量化區塊，以及您想要內嵌的任何其他字段。

除了技能之外，技能集還包括 indexProjections 和 cognitiveServices：

indexProjections 用於包含區塊化檔的索引。投影會指定父子內容如何對應至搜尋索引中的字段，以進行一對多索引編製。如需詳細資訊，請參閱定義索引投影。
cognitiveServices附加 Azure AI 多服務帳戶以供計費之用（檔版面配置技能可透過隨用隨付定價取得）。

POST {endpoint}/skillsets?api-version=2024-11-01-preview

{
  "name": "my_skillset",
  "description": "A skillset for structure-aware chunking and vectorization with a index projection around markdown section",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Util.DocumentIntelligenceLayoutSkill",
      "name": "my_document_intelligence_layout_skill",
      "context": "/document",
      "outputMode": "oneToMany",
      "inputs": [
        {
          "name": "file_data",
          "source": "/document/file_data"
        }
      ],
      "outputs": [
        {
          "name": "markdown_document",
          "targetName": "markdownDocument"
        }
      ],
      "markdownHeaderDepth": "h3"
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "my_markdown_section_split_skill",
      "description": "A skill that splits text into chunks",
      "context": "/document/markdownDocument/*",
      "inputs": [
        {
          "name": "text",
          "source": "/document/markdownDocument/*/content",
          "inputs": []
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ],
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 2000,
      "pageOverlapLength": 500,
      "unit": "characters"
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
      "name": "my_azure_openai_embedding_skill",
      "context": "/document/markdownDocument/*/pages/*",
      "inputs": [
        {
          "name": "text",
          "source": "/document/markdownDocument/*/pages/*",
          "inputs": []
        }
      ],
      "outputs": [
        {
          "name": "embedding",
          "targetName": "text_vector"
        }
      ],
      "resourceUri": "https://<subdomain>.openai.azure.com",
      "deploymentId": "text-embedding-3-small",
      "apiKey": "<Azure OpenAI api key>",
      "modelName": "text-embedding-3-small"
    }
  ],
  "cognitiveServices": {
    "@odata.type": "#Microsoft.Azure.Search.CognitiveServicesByKey",
    "key": "<Cognitive Services api key>"
  },
  "indexProjections": {
    "selectors": [
      {
        "targetIndexName": "my_consolidated_index",
        "parentKeyFieldName": "text_parent_id",
        "sourceContext": "/document/markdownDocument/*/pages/*",
        "mappings": [
          {
            "name": "text_vector",
            "source": "/document/markdownDocument/*/pages/*/text_vector"
          },
          {
            "name": "chunk",
            "source": "/document/markdownDocument/*/pages/*"
          },
          {
            "name": "title",
            "source": "/document/title"
          },
          {
            "name": "header_1",
            "source": "/document/markdownDocument/*/sections/h1"
          },
          {
            "name": "header_2",
            "source": "/document/markdownDocument/*/sections/h2"
          },
          {
            "name": "header_3",
            "source": "/document/markdownDocument/*/sections/h3"
          }
        ]
      }
    ],
    "parameters": {
      "projectionMode": "skipIndexingParentDocuments"
    }
  }
}

設定和執行索引子

建立數據源、索引和技能集之後，即可建立並執行索引器。此步驟會將管線放入執行中。

使用檔案設定技能時，請務必在索引器定義上設定下列參數：

參數 allowSkillsetToReadFileData 應設定為 true。
parsingMode參數應該設定為 default。

outputFieldMappings 不需要在此案例中設定，因為 indexProjections 處理來源字段以搜尋欄位關聯。索引投影會處理檔版面配置技能的欄位關聯，也會使用匯入和向量化數據工作負載的分割技能進行一般區塊化。轉換或複雜數據對應仍需要輸出字段對應，這些函式適用於其他案例。不過，對於每個檔的 n 區塊，索引投影會以原生方式處理這項功能。

以下是索引器建立要求的範例。

POST {endpoint}/indexers?api-version=2024-11-01-preview

{
  "name": "my_indexer",
  "dataSourceName": "my_blob_datasource",
  "targetIndexName": "my_consolidated_index",
  "skillsetName": "my_skillset",
  "parameters": {
    "batchSize": 1,
    "configuration": {
        "dataToExtract": "contentAndMetadata",
        "parsingMode": "default",
        "allowSkillsetToReadFileData": true
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "title"
    }
  ],
  "outputFieldMappings": []
}

當您將要求傳送至搜尋服務時，索引器就會執行。

驗證結果

您可以在處理結束后查詢搜尋索引，以測試您的解決方案。

若要檢查結果，請對索引執行查詢。使用搜尋總管做為搜尋用戶端，或任何傳送 HTTP 要求的工具。下列查詢會選取包含 markdown 區段非vector 內容及其向量輸出的欄位。

針對搜尋總管，您可以只複製 JSON，並將它貼到 JSON 檢視中以供查詢執行。

POST /indexes/[index name]/docs/search?api-version=[api-version]
{
  "search": "copay for in-network providers",
  "count": true,
  "searchMode": "all",
  "vectorQueries": [
    {
      "kind": "text",
      "text": "*",
      "fields": "text_vector,image_vector"
    }
  ],
  "queryType": "semantic",
  "semanticConfiguration": "healthplan-doc-layout-test-semantic-configuration",
  "captions": "extractive",
  "answers": "extractive|count-3",
  "queryLanguage": "en-us",
  "select": "header_1, header_2, header_3"
}

如果您使用健康情況計劃 PDF 來測試此技能，範例查詢的搜尋總管結果看起來應該類似下列螢幕快照中的結果。

此查詢是文字和向量上的混合式查詢，因此您會看到 @search.rerankerScore 和結果依該分數進行排名。 searchMode=all 表示所有查詢字詞都必須視為相符專案（預設值為 any）。
查詢會使用語意排名，因此您會看到 captions （它也有 answers，但螢幕快照中未顯示這些排名）。結果與查詢輸入最有語意相關，由語意排名器決定。
select語句（未顯示在螢幕快照中）會指定檔版面配置技能偵測並填入的標頭字段。您可以將更多欄位新增至 select 子句，以檢查區塊、標題或任何其他人類可讀取字段的內容。

共用方式為