How to use Cohere Embed V3 models with Azure AI Foundry

Important

Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

In this article, you learn about Cohere Embed V3 models and how to use them with Azure AI Foundry. The Cohere family of models includes various models optimized for different use cases, including chat completions, embeddings, and rerank. Cohere models are optimized for various use cases that include reasoning, summarization, and question answering.

Important

Models that are in preview are marked as preview on their model cards in the model catalog.

Cohere embedding models

The Cohere family of models for embeddings includes the following models:

Cohere Embed English is a multimodal (text and image) representation model used for semantic search, retrieval-augmented generation (RAG), classification, and clustering. Embed English performs well on the HuggingFace (massive text embed) MTEB benchmark and on use-cases for various industries, such as Finance, Legal, and General-Purpose Corpora. Embed English also has the following attributes:

  • Embed English has 1,024 dimensions
  • Context window of the model is 512 tokens
  • Embed English accepts images as a base64 encoded data url

Image embeddings consume a fixed number of tokens per image—1,000 tokens per image—which translates to a price of $0.0001 per image embedded. The size or resolution of the image doesn't affect the number of tokens consumed, provided the image is within the accepted dimensions, file size, and formats.

Prerequisites

To use Cohere Embed V3 models with Azure AI Foundry, you need the following prerequisites:

A model deployment

Deployment to serverless APIs

Cohere Embed V3 models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need.

Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Foundry portal, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to deploy the model as a serverless API.

The inference package installed

You can consume predictions from this model by using the azure-ai-inference package with Python. To install this package, you need the following prerequisites:

  • Python 3.8 or later installed, including pip.
  • The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form https://your-host-name.your-azure-region.inference.ai.azure.com, where your-host-name is your unique model deployment host name and your-azure-region is the Azure region where the model is deployed (for example, eastus2).
  • Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string.

Once you have these prerequisites, install the Azure AI inference package with the following command:

pip install azure-ai-inference

Read more about the Azure AI inference package and reference.

Tip

Additionally, Cohere supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check Cohere documentation.

Work with embeddings

In this section, you use the Azure AI model inference API with an embeddings model.

Create a client to consume the model

First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables.

import os
from azure.ai.inference import EmbeddingsClient
from azure.core.credentials import AzureKeyCredential

model = EmbeddingsClient(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=AzureKeyCredential(os.environ["AZURE_INFERENCE_CREDENTIAL"]),
)

Get the model's capabilities

The /info route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method:

model_info = model.get_model_info()

The response is as follows:

print("Model name:", model_info.model_name)
print("Model type:", model_info.model_type)
print("Model provider name:", model_info.model_provider)
Model name: Cohere-embed-v3-english
Model type": embeddings
Model provider name": Cohere

Create embeddings

Create an embedding request to see the output of the model.

response = model.embed(
    input=["The ultimate answer to the question of life"],
)

Tip

The context window for Cohere Embed V3 models is 512. Make sure that you don't exceed this limit when creating embeddings.

The response is as follows, where you can see the model's usage statistics:

import numpy as np

for embed in response.data:
    print("Embeding of size:", np.asarray(embed.embedding).shape)

print("Model:", response.model)
print("Usage:", response.usage)

It can be useful to compute embeddings in input batches. The parameter inputs can be a list of strings, where each string is a different input. In turn the response is a list of embeddings, where each embedding corresponds to the input in the same position.

response = model.embed(
    input=[
        "The ultimate answer to the question of life", 
        "The largest planet in our solar system is Jupiter",
    ],
)

The response is as follows, where you can see the model's usage statistics:

import numpy as np

for embed in response.data:
    print("Embeding of size:", np.asarray(embed.embedding).shape)

print("Model:", response.model)
print("Usage:", response.usage)

Tip

Cohere Embed V3 models can take batches of 1024 at a time. When creating batches, make sure that you don't exceed this limit.

Create different types of embeddings

Cohere Embed V3 models can generate multiple embeddings for the same input depending on how you plan to use them. This capability allows you to retrieve more accurate embeddings for RAG patterns.

The following example shows how to create embeddings that are used to create an embedding for a document that will be stored in a vector database:

from azure.ai.inference.models import EmbeddingInputType

response = model.embed(
    input=["The answer to the ultimate question of life, the universe, and everything is 42"],
    input_type=EmbeddingInputType.DOCUMENT,
)

When you work on a query to retrieve such a document, you can use the following code snippet to create the embeddings for the query and maximize the retrieval performance.

from azure.ai.inference.models import EmbeddingInputType

response = model.embed(
    input=["What's the ultimate meaning of life?"],
    input_type=EmbeddingInputType.QUERY,
)

Cohere Embed V3 models can optimize the embeddings based on its use case.

Cohere embedding models

The Cohere family of models for embeddings includes the following models:

Cohere Embed English is a multimodal (text and image) representation model used for semantic search, retrieval-augmented generation (RAG), classification, and clustering. Embed English performs well on the HuggingFace (massive text embed) MTEB benchmark and on use-cases for various industries, such as Finance, Legal, and General-Purpose Corpora. Embed English also has the following attributes:

  • Embed English has 1,024 dimensions
  • Context window of the model is 512 tokens
  • Embed English accepts images as a base64 encoded data url

Image embeddings consume a fixed number of tokens per image—1,000 tokens per image—which translates to a price of $0.0001 per image embedded. The size or resolution of the image doesn't affect the number of tokens consumed, provided the image is within the accepted dimensions, file size, and formats.

Prerequisites

To use Cohere Embed V3 models with Azure AI Foundry, you need the following prerequisites:

A model deployment

Deployment to serverless APIs

Cohere Embed V3 models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need.

Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Foundry portal, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to deploy the model as a serverless API.

The inference package installed

You can consume predictions from this model by using the @azure-rest/ai-inference package from npm. To install this package, you need the following prerequisites:

  • LTS versions of Node.js with npm.
  • The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form https://your-host-name.your-azure-region.inference.ai.azure.com, where your-host-name is your unique model deployment host name and your-azure-region is the Azure region where the model is deployed (for example, eastus2).
  • Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string.

Once you have these prerequisites, install the Azure Inference library for JavaScript with the following command:

npm install @azure-rest/ai-inference

Tip

Additionally, Cohere supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check Cohere documentation.

Work with embeddings

In this section, you use the Azure AI model inference API with an embeddings model.

Create a client to consume the model

First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables.

import ModelClient from "@azure-rest/ai-inference";
import { isUnexpected } from "@azure-rest/ai-inference";
import { AzureKeyCredential } from "@azure/core-auth";

const client = new ModelClient(
    process.env.AZURE_INFERENCE_ENDPOINT, 
    new AzureKeyCredential(process.env.AZURE_INFERENCE_CREDENTIAL)
);

Get the model's capabilities

The /info route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method:

await client.path("/info").get()

The response is as follows:

console.log("Model name: ", model_info.body.model_name);
console.log("Model type: ", model_info.body.model_type);
console.log("Model provider name: ", model_info.body.model_provider_name);
Model name: Cohere-embed-v3-english
Model type": embeddings
Model provider name": Cohere

Create embeddings

Create an embedding request to see the output of the model.

var response = await client.path("/embeddings").post({
    body: {
        input: ["The ultimate answer to the question of life"],
    }
});

Tip

The context window for Cohere Embed V3 models is 512. Make sure that you don't exceed this limit when creating embeddings.

The response is as follows, where you can see the model's usage statistics:

if (isUnexpected(response)) {
    throw response.body.error;
}

console.log(response.embedding);
console.log(response.body.model);
console.log(response.body.usage);

It can be useful to compute embeddings in input batches. The parameter inputs can be a list of strings, where each string is a different input. In turn the response is a list of embeddings, where each embedding corresponds to the input in the same position.

var response = await client.path("/embeddings").post({
    body: {
        input: [
            "The ultimate answer to the question of life", 
            "The largest planet in our solar system is Jupiter",
        ],
    }
});

The response is as follows, where you can see the model's usage statistics:

if (isUnexpected(response)) {
    throw response.body.error;
}

console.log(response.embedding);
console.log(response.body.model);
console.log(response.body.usage);

Tip

Cohere Embed V3 models can take batches of 1024 at a time. When creating batches, make sure that you don't exceed this limit.

Create different types of embeddings

Cohere Embed V3 models can generate multiple embeddings for the same input depending on how you plan to use them. This capability allows you to retrieve more accurate embeddings for RAG patterns.

The following example shows how to create embeddings that are used to create an embedding for a document that will be stored in a vector database:

var response = await client.path("/embeddings").post({
    body: {
        input: ["The answer to the ultimate question of life, the universe, and everything is 42"],
        input_type: "document",
    }
});

When you work on a query to retrieve such a document, you can use the following code snippet to create the embeddings for the query and maximize the retrieval performance.

var response = await client.path("/embeddings").post({
    body: {
        input: ["What's the ultimate meaning of life?"],
        input_type: "query",
    }
});

Cohere Embed V3 models can optimize the embeddings based on its use case.

Cohere embedding models

The Cohere family of models for embeddings includes the following models:

Cohere Embed English is a multimodal (text and image) representation model used for semantic search, retrieval-augmented generation (RAG), classification, and clustering. Embed English performs well on the HuggingFace (massive text embed) MTEB benchmark and on use-cases for various industries, such as Finance, Legal, and General-Purpose Corpora. Embed English also has the following attributes:

  • Embed English has 1,024 dimensions
  • Context window of the model is 512 tokens
  • Embed English accepts images as a base64 encoded data url

Image embeddings consume a fixed number of tokens per image—1,000 tokens per image—which translates to a price of $0.0001 per image embedded. The size or resolution of the image doesn't affect the number of tokens consumed, provided the image is within the accepted dimensions, file size, and formats.

Prerequisites

To use Cohere Embed V3 models with Azure AI Foundry, you need the following prerequisites:

A model deployment

Deployment to serverless APIs

Cohere Embed V3 models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need.

Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Foundry portal, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to deploy the model as a serverless API.

A REST client

Models deployed with the Azure AI model inference API can be consumed using any REST client. To use the REST client, you need the following prerequisites:

  • To construct the requests, you need to pass in the endpoint URL. The endpoint URL has the form https://your-host-name.your-azure-region.inference.ai.azure.com, where your-host-name is your unique model deployment host name and your-azure-region is the Azure region where the model is deployed (for example, eastus2).
  • Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string.

Tip

Additionally, Cohere supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check Cohere documentation.

Work with embeddings

In this section, you use the Azure AI model inference API with an embeddings model.

Create a client to consume the model

First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables.

Get the model's capabilities

The /info route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method:

GET /info HTTP/1.1
Host: <ENDPOINT_URI>
Authorization: Bearer <TOKEN>
Content-Type: application/json

The response is as follows:

{
    "model_name": "Cohere-embed-v3-english",
    "model_type": "embeddings",
    "model_provider_name": "Cohere"
}

Create embeddings

Create an embedding request to see the output of the model.

{
    "input": [
        "The ultimate answer to the question of life"
    ]
}

Tip

The context window for Cohere Embed V3 models is 512. Make sure that you don't exceed this limit when creating embeddings.

The response is as follows, where you can see the model's usage statistics:

{
    "id": "0ab1234c-d5e6-7fgh-i890-j1234k123456",
    "object": "list",
    "data": [
        {
            "index": 0,
            "object": "embedding",
            "embedding": [
                0.017196655,
                // ...
                -0.000687122,
                -0.025054932,
                -0.015777588
            ]
        }
    ],
    "model": "Cohere-embed-v3-english",
    "usage": {
        "prompt_tokens": 9,
        "completion_tokens": 0,
        "total_tokens": 9
    }
}

It can be useful to compute embeddings in input batches. The parameter inputs can be a list of strings, where each string is a different input. In turn the response is a list of embeddings, where each embedding corresponds to the input in the same position.

{
    "input": [
        "The ultimate answer to the question of life", 
        "The largest planet in our solar system is Jupiter"
    ]
}

The response is as follows, where you can see the model's usage statistics:

{
    "id": "0ab1234c-d5e6-7fgh-i890-j1234k123456",
    "object": "list",
    "data": [
        {
            "index": 0,
            "object": "embedding",
            "embedding": [
                0.017196655,
                // ...
                -0.000687122,
                -0.025054932,
                -0.015777588
            ]
        },
        {
            "index": 1,
            "object": "embedding",
            "embedding": [
                0.017196655,
                // ...
                -0.000687122,
                -0.025054932,
                -0.015777588
            ]
        }
    ],
    "model": "Cohere-embed-v3-english",
    "usage": {
        "prompt_tokens": 19,
        "completion_tokens": 0,
        "total_tokens": 19
    }
}

Tip

Cohere Embed V3 models can take batches of 1024 at a time. When creating batches, make sure that you don't exceed this limit.

Create different types of embeddings

Cohere Embed V3 models can generate multiple embeddings for the same input depending on how you plan to use them. This capability allows you to retrieve more accurate embeddings for RAG patterns.

The following example shows how to create embeddings that are used to create an embedding for a document that will be stored in a vector database:

{
    "input": [
        "The answer to the ultimate question of life, the universe, and everything is 42"
    ],
    "input_type": "document"
}

When you work on a query to retrieve such a document, you can use the following code snippet to create the embeddings for the query and maximize the retrieval performance.

{
    "input": [
        "What's the ultimate meaning of life?"
    ],
    "input_type": "query"
}

Cohere Embed V3 models can optimize the embeddings based on its use case.

More inference examples

Description Language Sample
Web requests Bash cohere-embed.ipynb
Azure AI Inference package for JavaScript JavaScript Link
Azure AI Inference package for Python Python Link
OpenAI SDK (experimental) Python Link
LangChain Python Link
Cohere SDK Python Link
LiteLLM SDK Python Link

Retrieval Augmented Generation (RAG) and tool use samples

Description Packages Sample
Create a local Facebook AI similarity search (FAISS) vector index, using Cohere embeddings - Langchain langchain, langchain_cohere cohere_faiss_langchain_embed.ipynb
Use Cohere Command R/R+ to answer questions from data in local FAISS vector index - Langchain langchain, langchain_cohere command_faiss_langchain.ipynb
Use Cohere Command R/R+ to answer questions from data in AI search vector index - Langchain langchain, langchain_cohere cohere-aisearch-langchain-rag.ipynb
Use Cohere Command R/R+ to answer questions from data in AI search vector index - Cohere SDK cohere, azure_search_documents cohere-aisearch-rag.ipynb
Command R+ tool/function calling, using LangChain cohere, langchain, langchain_cohere command_tools-langchain.ipynb

Cost and quota considerations for Cohere family of models deployed as serverless API endpoints

Cohere models deployed as a serverless API are offered by Cohere through the Azure Marketplace and integrated with Azure AI Foundry for use. You can find the Azure Marketplace pricing when deploying the model.

Each time a project subscribes to a given offer from the Azure Marketplace, a new resource is created to track the costs associated with its consumption. The same resource is used to track costs associated with inference; however, multiple meters are available to track each scenario independently.

For more information on how to track costs, see monitor costs for models offered throughout the Azure Marketplace.

Quota is managed per deployment. Each deployment has a rate limit of 200,000 tokens per minute and 1,000 API requests per minute. However, we currently limit one deployment per model per project. Contact Microsoft Azure Support if the current rate limits aren't sufficient for your scenarios.