Data import in Azure AI Search
In Azure AI Search, queries execute over user-owned content that's loaded into a search index. This article describes the two basic workflows for populating an index: push your data into the index programmatically, or pull in the data using a search indexer.
Both approaches load documents from an external data source. Although you can create an empty index, it's not queryable until you add the content.
Note
If AI enrichment or integrated vectorization are solution requirements, you must use the pull model (indexers) to load an index. Skillsets are attached to indexers and don't run independently.
Pushing data to an index
Push model is an approach that uses APIs to upload documents into an existing search index. You can upload documents individually or in batches up to 1000 per batch, or 16 MB per batch, whichever limit comes first.
Key benefits include:
No restrictions on data source type. The payload must be composed of JSON documents that map to your index schema, but the data can be sourced from anywhere.
No restrictions on frequency of execution. You can push changes to an index as often as you like. For applications having low latency requirements (for example, when the index needs to be in sync with product inventory fluctuations), the push model is your only option.
Connectivity and the secure retrieval of documents are fully under your control. In contrast, indexer connections are authenticated using the security features provided in Azure AI Search.
How to push data to an Azure AI Search index
Use the following APIs to load single or multiple documents into an index:
- Index Documents (REST API)
- IndexDocumentsAsync (Azure SDK for .NET) or SearchIndexingBufferedSender
- IndexDocumentsBatch (Azure SDK for Python) or SearchIndexingBufferedSender
- IndexDocumentsBatch (Azure SDK for Java) or SearchIndexingBufferedSender
- IndexDocumentsBatch (Azure SDK for JavaScript or SearchIndexingBufferedSender
There's no support for pushing data via the Azure portal.
For an introduction to the push APIs, see:
- Quickstart: Full text search using the Azure SDKs
- C# Tutorial: Optimize indexing with the push API
- REST Quickstart: Create an Azure AI Search index using PowerShell
Indexing actions: upload, merge, mergeOrUpload, delete
You can control the type of indexing action on a per-document basis, specifying whether the document should be uploaded in full, merged with existing document content, or deleted.
Whether you use the REST API or an Azure SDK, the following document operations are supported for data import:
Upload, similar to an "upsert" where the document is inserted if it's new, and updated or replaced if it exists. If the document is missing values that the index requires, the document field's value is set to null.
merge updates a document that already exists, and fails a document that can't be found. Merge replaces existing values. For this reason, be sure to check for collection fields that contain multiple values, such as fields of type
Collection(Edm.String)
. For example, if atags
field starts with a value of["budget"]
and you execute a merge with["economy", "pool"]
, the final value of thetags
field is["economy", "pool"]
. It won't be["budget", "economy", "pool"]
.mergeOrUpload behaves like merge if the document exists, and upload if the document is new.
delete removes the entire document from the index. If you want to remove an individual field, use merge instead, setting the field in question to null.
Pulling data into an index
The pull model uses indexers connecting to a supported data source, automatically uploading the data into your index. Indexers from Microsoft are available for these platforms:
- Azure Blob storage
- Azure Table storage
- Azure Data Lake Storage Gen2
- Azure Files (preview)
- Azure Cosmos DB
- Azure SQL Database, SQL Managed Instance, and SQL Server on Azure VMs
- OneLake files and shortcuts
- SharePoint Online (preview)
You can use third-party connectors, developed and maintained by Microsoft partners. For more information and links, see Data source gallery.
Indexers connect an index to a data source (usually a table, view, or equivalent structure), and map source fields to equivalent fields in the index. During execution, the rowset is automatically transformed to JSON and loaded into the specified index. All indexers support schedules so that you can specify how frequently the data is to be refreshed. Most indexers provide change tracking if the data source supports it. By tracking changes and deletes to existing documents in addition to recognizing new documents, indexers remove the need to actively manage the data in your index.
How to pull data into an Azure AI Search index
Use the following tools and APIs for indexer-based indexing:
- Import data wizard or Import and vectorize data wizard
- REST APIs: Create Indexer (REST), Create Data Source (REST), Create Index (REST)
- Azure SDK for .NET: SearchIndexer, SearchIndexerDataSourceConnection, SearchIndex,
- Azure SDK for Python: SearchIndexer, SearchIndexerDataSourceConnection, SearchIndex,
- Azure SDK for Java: SearchIndexer, SearchIndexerDataSourceConnection, SearchIndex,
- Azure SDK for JavaScript: SearchIndexer, SearchIndexerDataSourceConnection, SearchIndex,
Indexer functionality is exposed in the [Azure portal], the REST API, and the .NET SDK.
An advantage to using the Azure portal is that Azure AI Search can usually generate a default index schema by reading the metadata of the source dataset.
Verify data import with Search explorer
A quick way to perform a preliminary check on the document upload is to use Search explorer in the Azure portal.
The explorer lets you query an index without having to write any code. The search experience is based on default settings, such as the simple syntax and default searchMode query parameter. Results are returned in JSON so that you can inspect the entire document.
Here's an example query that you can run in Search Explorer in JSON view. The "HotelId" is the document key of the hotels-sample-index. The filter provides the document ID of a specific document:
{
"search": "*",
"filter": "HotelId eq '50'"
}
If you're using REST, this Look up query achieves the same purpose.