แก้ไข

แชร์ผ่าน


Quickstart: Vectorize text and images by using the Azure portal

This quickstart helps you get started with integrated vectorization by using the Import and vectorize data wizard in the Azure portal. The wizard chunks your content and calls an embedding model to vectorize content during indexing and for queries.

Prerequisites

Supported data sources

The Import and vectorize data wizard supports a wide range of Azure data sources, but this quickstart provides steps for just those data sources that work with whole files:

Supported embedding models

Use an embedding model on an Azure AI platform in the same region as Azure AI Search. Deployment instructions are in this article.

Provider Supported models
Azure OpenAI Service text-embedding-ada-002
text-embedding-3-large
text-embedding-3-small
Azure AI Foundry model catalog For text:
Cohere-embed-v3-english
Cohere-embed-v3-multilingual
For images:
Facebook-DinoV2-Image-Embeddings-ViT-Base
Facebook-DinoV2-Image-Embeddings-ViT-Giant
Azure AI services multi-service account Azure AI Vision multimodal for image and text vectorization, available in selected regions. Depending on how you attach the multi-service resource, the multi-service account might need to be in the same region as Azure AI Search.

If you use the Azure OpenAI Service, the endpoint must have an associated custom subdomain. A custom subdomain is an endpoint that includes a unique name (for example, https://hereismyuniquename.cognitiveservices.azure.com). If the service was created through the Azure portal, this subdomain is automatically generated as part of your service setup. Ensure that your service includes a custom subdomain before using it with the Azure AI Search integration.

Azure OpenAI Service resources (with access to embedding models) that were created in Azure AI Foundry portal aren't supported. Only the Azure OpenAI Service resources created in the Azure portal are compatible with the Azure OpenAI Embedding skill integration.

Public endpoint requirements

For the purposes of this quickstart, all of the preceding resources must have public access enabled so that the Azure portal nodes can access them. Otherwise, the wizard fails. After the wizard runs, you can enable firewalls and private endpoints on the integration components for security. For more information, see Secure connections in the import wizards.

If private endpoints are already present and you can't disable them, the alternative option is to run the respective end-to-end flow from a script or program on a virtual machine. The virtual machine must be on the same virtual network as the private endpoint. Here's a Python code sample for integrated vectorization. The same GitHub repo has samples in other programming languages.

Permissions

You can use key authentication and full access connection strings, or Microsoft Entra ID with role assignments. We recommend role assignments for search service connections to other resources.

  1. On Azure AI Search, enable roles.

  2. Configure your search service to use a managed identity.

  3. On your data source platform and embedding model provider, create role assignments that allow search service to access data and models. Prepare sample data provides instructions for setting up roles for each supported data source.

A free search service supports role-based connections to Azure AI Search, but it doesn't support managed identities on outbound connections to Azure Storage or Azure AI Vision. This level of support means you must use key-based authentication on connections between a free search service and other Azure services.

For more secure connections:

Note

If you can't progress through the wizard because options aren't available (for example, you can't select a data source or an embedding model), revisit the role assignments. Error messages indicate that models or deployments don't exist, when in fact the real cause is that the search service doesn't have permission to access them.

Check for space

If you're starting with the free service, you're limited to three indexes, data sources, skillsets, and indexers. Basic limits you to 15. Make sure you have room for extra items before you begin. This quickstart creates one of each object.

Prepare sample data

This section points you to the content that works for this quickstart.

  1. Sign in to the Azure portal with your Azure account, and go to your Azure Storage account.

  2. On the left pane, under Data Storage, select Containers.

  3. Create a new container and then upload the health-plan PDF documents used for this quickstart.

  4. On the left pane, under Access control, assign the Storage Blob Data Reader role to the search service identity. Or, get a connection string to the storage account from the Access keys page.

  5. Optionally, synchronize the deletions in your container with deletions in the search index. These next steps allow you to configure the indexer for deletion detection:

    1. Enable soft delete on your storage account.

    2. If you're using native soft delete, no further steps are required on Azure Storage.

    3. Otherwise, add custom metadata that an indexer can scan to determine which blobs are marked for deletion. Give your custom property a descriptive name. For example, you could name the property "IsDeleted", set to false. Do this for every blob in the container. Later, when you want to delete the blob, change the property to true. For more information, see Change and delete detection when indexing from Azure Storage

Set up embedding models

The wizard can use embedding models deployed from Azure OpenAI, Azure AI Vision, or from the model catalog in Azure AI Foundry portal.

The wizard supports text-embedding-ada-002, text-embedding-3-large, and text-embedding-3-small. Internally, the wizard calls the AzureOpenAIEmbedding skill to connect to Azure OpenAI.

  1. Sign in to the Azure portal with your Azure account, and go to your Azure OpenAI resource.

  2. Set up permissions:

    1. On the left menu, select Access control.

    2. Select Add, and then select Add role assignment.

    3. Under Job function roles, select Cognitive Services OpenAI User, and then select Next.

    4. Under Members, select Managed identity, and then select Members.

    5. Filter by subscription and resource type (search services), and then select the managed identity of your search service.

    6. Select Review + assign.

  3. On the Overview page, select Click here to view endpoints or Click here to manage keys if you need to copy an endpoint or API key. You can paste these values into the wizard if you're using an Azure OpenAI resource with key-based authentication.

  4. Under Resource Management and Model deployments, select Manage Deployments to open Azure AI Foundry.

  5. Copy the deployment name of text-embedding-ada-002 or another supported embedding model. If you don't have an embedding model, deploy one now.

Start the wizard

  1. Sign in to the Azure portal with your Azure account, and go to your Azure AI Search service.

  2. On the Overview page, select Import and vectorize data.

    Screenshot of the command to open the wizard for importing and vectorizing data.

Connect to your data

The next step is to connect to a data source to use for the search index.

  1. On Connect to your data, select Azure Blob Storage.

  2. Specify the Azure subscription.

  3. Choose the storage account and container that provide the data.

  4. Specify whether you want deletion detection support. On subsequent indexing runs, the search index is updated to remove any search documents based on soft-deleted blobs on Azure Storage.

    • Blobs support either Native blob soft delete or Soft delete using custom data.
    • You must have previously enabled soft delete on Azure Storage, and optionally added custom metadata that indexing can recognize as a deletion flag. For more information about these steps, see Prepare sample data.
    • If you configured your blobs for soft delete using custom data, provide the metadata property name-value pair in this step. We recommend "IsDeleted". If "IsDeleted" is set to true on a blob, the indexer drops the corresponding search document on the next indexer run.

    The wizard doesn't check Azure Storage for valid settings or throw an error if the requirements aren't met. Instead, deletion detection doesn't work, and your search index is likely to collect orphaned documents over time.

    Screenshot of the data source page with deletion detection options.

  5. Specify whether you want your search service to connect to Azure Storage using its managed identity.

    • You're prompted to choose either a system-managed or user-managed identity.
    • The identity should have a Storage Blob Data Reader role on Azure Storage.
    • Don't skip this step. A connection error occurs during indexing if the wizard can't connect to Azure Storage.
  6. Select Next.

Vectorize your text

In this step, specify the embedding model for vectorizing chunked data.

Chunking is built in and nonconfigurable. The effective settings are:

"textSplitMode": "pages",
"maximumPageLength": 2000,
"pageOverlapLength": 500,
"maximumPagesToTake": 0, #unlimited
"unit": "characters"
  1. On the Vectorize your text page, choose the source of the embedding model:

    • Azure OpenAI
    • Azure AI Foundry model catalog
    • An existing Azure AI Vision multimodal resource in the same region as Azure AI Search. If there's no Azure AI Services multi-service account in the same region, this option isn't available.
  2. Choose the Azure subscription.

  3. Make selections according to the resource:

    • For Azure OpenAI, choose an existing deployment of text-embedding-ada-002, text-embedding-3-large, or text-embedding-3-small.

    • For Azure AI Foundry catalog, choose an existing deployment of an Azure or Cohere embedding model.

    • For AI Vision multimodal embeddings, select the account.

    For more information, see Set up embedding models earlier in this article.

  4. Specify whether you want your search service to authenticate using an API key or managed identity.

    • The identity should have a Cognitive Services User role on the Azure AI multi-services account.
  5. Select the checkbox that acknowledges the billing effects of using these resources.

    Screenshot of the vectorize text page in the wizard.

  6. Select Next.

Vectorize and enrich your images

The health plan PDFs include a corporate logo, but otherwise there are no images. You can skip this step if you're using the sample documents.

However, if you work with content that includes useful images, you can apply AI in two ways:

  • Use a supported image embedding model from the catalog, or choose the Azure AI Vision multimodal embeddings API to vectorize images.

  • Use optical character recognition (OCR) to recognize text in images. This option invokes the OCR skill to read text from images.

Azure AI Search and your Azure AI resource must be in the same region or configured for keyless billing connections.

  1. On the Vectorize your images page, specify the kind of connection the wizard should make. For image vectorization, the wizard can connect to embedding models in Azure AI Foundry portal or Azure AI Vision.

  2. Specify the subscription.

  3. For the Azure AI Foundry model catalog, specify the project and deployment. For more information, see Set up embedding models earlier in this article.

  4. Optionally, you can crack binary images (for example, scanned document files) and use OCR to recognize text.

  5. Select the checkbox that acknowledges the billing effects of using these resources.

    Screenshot of the vectorize images page in the wizard.

  6. Select Next.

Add semantic ranking

On the Advanced settings page, you can optionally add semantic ranking to rerank results at the end of query execution. Reranking promotes the most semantically relevant matches to the top.

Map new fields

Key points about this step:

  • Index schema provides vector and nonvector fields for chunked data.
  • You can add fields, but you can't delete or modify generated fields.
  • Document parsing mode creates chunks (one search document per chunk).

On the Advanced settings page, you can optionally add new fields assuming the data source provides metadata or fields that aren't picked up on the first pass. By default, the wizard generates the following fields with these attributes:

Field Applies to Description
chunk_id Text and image vectors Generated string field. Searchable, retrievable, sortable. This is the document key for the index.
text_parent_id Text vectors Generated string field. Retrievable, filterable. Identifies the parent document from which the chunk originates.
chunk Text and image vectors String field. Human readable version of the data chunk. Searchable and retrievable, but not filterable, facetable, or sortable.
title Text and image vectors String field. Human readable document title or page title or page number. Searchable and retrievable, but not filterable, facetable, or sortable.
text_vector Text vectors Collection(Edm.single). Vector representation of the chunk. Searchable and retrievable, but not filterable, facetable, or sortable.

You can't modify the generated fields or their attributes, but you can add new fields if your data source provides them. For example, Azure Blob Storage provides a collection of metadata fields.

  1. Select Add new.

  2. Choose a source field from the list of available fields, provide a field name for the index, and accept the default data type or override as needed.

    Metadata fields are searchable, but not retrievable, filterable, facetable, or sortable.

  3. Select Reset if you want to restore the schema to its original version.

Schedule indexing

On the Advanced settings page, you can optionally specify a run schedule for the indexer.

  1. Select Next when you're done with the Advanced settings page.

Finish the wizard

  1. On the Review your configuration page, specify a prefix for the objects that the wizard creates. A common prefix helps you stay organized.

  2. Select Create.

When the wizard completes the configuration, it creates the following objects:

  • Data source connection.

  • Index with vector fields, vectorizers, vector profiles, and vector algorithms. You can't design or modify the default index during the wizard workflow. Indexes conform to the 2024-05-01-preview REST API.

  • Skillset with the Text Split skill for chunking and an embedding skill for vectorization. The embedding skill is either the AzureOpenAIEmbeddingModel skill for Azure OpenAI or the AML skill for the Azure AI Foundry model catalog. The skillset also has the index projections configuration that allows data to be mapped from one document in the data source to its corresponding chunks in a "child" index.

  • Indexer with field mappings and output field mappings (if applicable).

Check results

Search Explorer accepts text strings as input and then vectorizes the text for vector query execution.

  1. In the Azure portal, go to Search Management > Indexes, and then select the index that you created.

  2. Select Query options and hide vector values in search results. This step makes your search results easier to read.

    Screenshot of the button for query options.

  3. On the View menu, select JSON view so that you can enter text for your vector query in the text vector query parameter.

    Screenshot of the menu command for opening the JSON view.

    The default query is an empty search ("*"), but includes parameters for returning the number matches. It's a hybrid query that runs text and vector queries in parallel. It includes semantic ranking. It specifies which fields to return in the results through the select statement.

     {
       "search": "*",
       "count": true,
       "vectorQueries": [
         {
           "kind": "text",
           "text": "*",
           "fields": "text_vector,image_vector"
         }
       ],
       "queryType": "semantic",
       "semanticConfiguration": "my-demo-semantic-configuration",
       "captions": "extractive",
       "answers": "extractive|count-3",
       "queryLanguage": "en-us",
       "select": "chunk_id,text_parent_id,chunk,title,image_parent_id"
     }
    
  4. Replace both asterisk (*) placeholders with a question related to health plans, such as Which plan has the lowest deductible?.

     {
       "search": "Which plan has the lowest deductible?",
       "count": true,
       "vectorQueries": [
         {
           "kind": "text",
           "text": "Which plan has the lowest deductible?",
           "fields": "text_vector,image_vector"
         }
       ],
       "queryType": "semantic",
       "semanticConfiguration": "my-demo-semantic-configuration",
       "captions": "extractive",
       "answers": "extractive|count-3",
       "queryLanguage": "en-us",
       "select": "chunk_id,text_parent_id,chunk,title"
     }
    
  5. Select Search to run the query.

    Screenshot of search results.

    Each document is a chunk of the original PDF. The title field shows which PDF the chunk comes from. Each chunk is quite long. You can copy and paste one into a text editor to read the entire value.

  6. To see all of the chunks from a specific document, add a filter for the title_parent_id field for a specific PDF. You can check the Fields tab of your index to confirm this field is filterable.

    {
       "select": "chunk_id,text_parent_id,chunk,title",
       "filter": "text_parent_id eq 'aHR0cHM6Ly9oZWlkaXN0c3RvcmFnZWRlbW9lYXN0dXMuYmxvYi5jb3JlLndpbmRvd3MubmV0L2hlYWx0aC1wbGFuLXBkZnMvTm9ydGh3aW5kX1N0YW5kYXJkX0JlbmVmaXRzX0RldGFpbHMucGRm0'",
       "count": true,
       "vectorQueries": [
           {
              "kind": "text",
              "text": "*",
              "k": 5,
              "fields": "text_vector"
           }
        ]
    }
    

Clean up

Azure AI Search is a billable resource. If you no longer need it, delete it from your subscription to avoid charges.

Next step

This quickstart introduced you to the Import and vectorize data wizard that creates all of the necessary objects for integrated vectorization. If you want to explore each step in detail, try an integrated vectorization sample.