Azure AI Inference client library for Python - version 1.0.0b9

Article
02/15/2025

Use the Inference client library (in preview) to:

Authenticate against the service
Get information about the AI model
Do chat completions
Get text embeddings
Get image embeddings

The Inference client library supports AI models deployed to the following services:

GitHub Models - Free-tier endpoint for AI models from different providers
Serverless API endpoints and Managed Compute endpoints - AI models from different providers deployed from Azure AI Foundry. See Overview: Deploy models, flows, and web apps with Azure AI Foundry.
Azure OpenAI Service - OpenAI models deployed from Azure AI Foundry. See What is Azure OpenAI Service?. Although we recommend you use the official OpenAI client library in your production code for this service, you can use the Azure AI Inference client library to easily compare the performance of OpenAI models to other models, using the same client library and Python code.

The Inference client library makes services calls using REST API version 2024-05-01-preview, as documented in Azure AI Model Inference API.

Product documentation | Samples | API reference documentation | Package (Pypi) | SDK source code

Reporting issues

To report an issue with the client library, or request additional features, please open a GitHub issue here. Mention the package name "azure-ai-inference" in the title or content.

Getting started

Prerequisites

Python 3.8 or later installed, including pip.
For GitHub models
- The AI model name, such as "gpt-4o" or "mistral-large"
- A GitHub personal access token. Create one here. You do not need to give any permissions to the token. The token is a string that starts with github_pat_.
For Serverless API endpoints or Managed Compute endpoints
- An Azure subscription.
- An AI Model from the catalog deployed through Azure AI Foundry.
- The endpoint URL of your model, in of the form https://<your-host-name>.<your-azure-region>.models.ai.azure.com, where your-host-name is your unique model deployment host name and your-azure-region is the Azure region where the model is deployed (e.g. eastus2).
- Depending on your authentication preference, you either need an API key to authenticate against the service, or Entra ID credentials.
For Azure OpenAI (AOAI) service
- An Azure subscription.
- An OpenAI Model from the catalog deployed through Azure AI Foundry.
- The endpoint URL of your model, in the form https://<your-resouce-name>.openai.azure.com/openai/deployments/<your-deployment-name>, where your-resource-name is your globally unique AOAI resource name, and your-deployment-name is your AI Model deployment name.
- Depending on your authentication preference, you either need an API key to authenticate against the service, or Entra ID credentials.
- An api-version. Latest preview or GA version listed in the Data plane - inference row in the API Specs table. At the time of writing, latest GA version was "2024-06-01".

Install the package

To install the Azure AI Inferencing package use the following command:

pip install azure-ai-inference

To update an existing installation of the package, use:

pip install --upgrade azure-ai-inference

If you want to install Azure AI Inferencing package with support for OpenTelemetry based tracing, use the following command:

pip install azure-ai-inference[opentelemetry]

Key concepts

Create and authenticate a client directly, using API key or GitHub token

The package includes two clients ChatCompletionsClient and EmbeddingsClient. Both can be created in the similar manner. For example, assuming endpoint, key and github_token are strings holding your endpoint URL, API key or GitHub token, this Python code will create and authenticate a synchronous ChatCompletionsClient:

from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential

# For GitHub models
client = ChatCompletionsClient(
    endpoint="https://models.inference.ai.azure.com",
    credential=AzureKeyCredential(github_token),
    model="mistral-large" # Update as needed. Alternatively, you can include this is the `complete` call.
)

# For Serverless API or Managed Compute endpoints
client = ChatCompletionsClient(
    endpoint=endpoint,  # Of the form https://<your-host-name>.<your-azure-region>.models.ai.azure.com
    credential=AzureKeyCredential(key)
)

# For Azure OpenAI endpoint
client = ChatCompletionsClient(
    endpoint=endpoint,  # Of the form https://<your-resouce-name>.openai.azure.com/openai/deployments/<your-deployment-name>
    credential=AzureKeyCredential(key),
    api_version="2024-06-01",  # Azure OpenAI api-version. See https://aka.ms/azsdk/azure-ai-inference/azure-openai-api-versions
)

A synchronous client supports synchronous inference methods, meaning they will block until the service responds with inference results. For simplicity the code snippets below all use synchronous methods. The client offers equivalent asynchronous methods which are more commonly used in production.

To create an asynchronous client, Install the additional package aiohttp:

pip install aiohttp

and update the code above to import asyncio, and import ChatCompletionsClient from the azure.ai.inference.aio namespace instead of azure.ai.inference. For example:

import asyncio
from azure.ai.inference.aio import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential

# For Serverless API or Managed Compute endpoints
client = ChatCompletionsClient(
    endpoint=endpoint,
    credential=AzureKeyCredential(key)
)

Create and authenticate a client directly, using Entra ID

_Note: At the time of writing, only Managed Compute endpoints and Azure OpenAI endpoints support Entra ID authentication.

To use an Entra ID token credential, first install the azure-identity package:

pip install azure.identity

You will need to provide the desired credential type obtained from that package. A common selection is DefaultAzureCredential and it can be used as follows:

from azure.ai.inference import ChatCompletionsClient
from azure.identity import DefaultAzureCredential

# For Managed Compute endpoints
client = ChatCompletionsClient(
    endpoint=endpoint,
    credential=DefaultAzureCredential(exclude_interactive_browser_credential=False)
)

# For Azure OpenAI endpoint
client = ChatCompletionsClient(
    endpoint=endpoint,
    credential=DefaultAzureCredential(exclude_interactive_browser_credential=False),
    credential_scopes=["https://cognitiveservices.azure.com/.default"],
    api_version="2024-06-01",  # Azure OpenAI api-version. See https://aka.ms/azsdk/azure-ai-inference/azure-openai-api-versions
)

During application development, you would typically set up the environment for authentication using Entra ID by first Installing the Azure CLI, running az login in your console window, then entering your credentials in the browser window that was opened. The call to DefaultAzureCredential() will then succeed. Setting exclude_interactive_browser_credential=False in that call will enable launching a browser window if the user isn't already logged in.

Defining default settings while creating the clients

You can define default chat completions or embeddings configurations while constructing the relevant client. These configurations will be applied to all future service calls.

For example, here we create a ChatCompletionsClient using API key authentication, and apply two settings, temperature and max_tokens:

from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential

# For Serverless API or Managed Compute endpoints
client = ChatCompletionsClient(
    endpoint=endpoint,
    credential=AzureKeyCredential(key),
    temperature=0.5,
    max_tokens=1000
)

Default settings can be overridden in individual service calls.

Create and authenticate clients using `load_client`

If you are using Serverless API or Managed Compute endpoints, there is an alternative to creating a specific client directly. You can instead use the function load_client to return the relevant client (of types ChatCompletionsClient or EmbeddingsClient) based on the provided endpoint:

from azure.ai.inference import load_client
from azure.core.credentials import AzureKeyCredential

# For Serverless API or Managed Compute endpoints only.
# This will not work on GitHub Models endpoint or Azure OpenAI endpoint.
client = load_client(
    endpoint=endpoint,
    credential=AzureKeyCredential(key)
)

print(f"Created client of type `{type(client).__name__}`.")

To load an asynchronous client, import the load_client function from azure.ai.inference.aio instead.

Entra ID authentication is also supported by the load_client function. Replace the key authentication above with credential=DefaultAzureCredential(exclude_interactive_browser_credential=False) for example.

Get AI model information

If you are using Serverless API or Managed Compute endpoints, you can call the client method get_model_info to retrive AI model information. This makes a REST call to the /info route on the provided endpoint, as documented in the REST API reference. This call will not work for GitHub Models or Azure OpenAI endpoints.

model_info = client.get_model_info()

print(f"Model name: {model_info.model_name}")
print(f"Model provider name: {model_info.model_provider_name}")
print(f"Model type: {model_info.model_type}")

AI model information is cached in the client, and futher calls to get_model_info will access the cached value and wil not result in a REST API call. Note that if you created the client using load_client function, model information will already be cached in the client.

AI model information is displayed (if available) when you print(client).

Chat Completions

The ChatCompletionsClient has a method named complete. The method makes a REST API call to the /chat/completions route on the provided endpoint, as documented in the REST API reference.

See simple chat completion examples below. More can be found in the samples folder.

Text Embeddings

The EmbeddingsClient has a method named embed. The method makes a REST API call to the /embeddings route on the provided endpoint, as documented in the REST API reference.

See simple text embedding example below. More can be found in the samples folder.

Image Embeddings

The ImageEmbeddingsClient has a method named embed. The method makes a REST API call to the /images/embeddings route on the provided endpoint, as documented in the REST API reference.

See simple image embedding example below. More can be found in the samples folder.

Examples

In the following sections you will find simple examples of:

Chat completions
Streaming chat completions
Adding model-specific parameters
Adding HTTP request headers
Text Embeddings
Image Embeddings

The examples create a synchronous client assuming a Serverless API or Managed Compute endpoint. Modify client construction code as descirbed in Key concepts to have it work with GitHub Models endpoint or Azure OpenAI endpoint. Only mandatory input settings are shown for simplicity.

See the Samples folder for full working samples for synchronous and asynchronous clients.

Chat completions example

This example demonstrates how to generate a single chat completions, for a Serverless API or Managed Compute endpoint, with key authentication, assuming endpoint and key are already defined. For Entra ID authentication, GitHub models endpoint or Azure OpenAI endpoint, modify the code to create the client as specified in the above sections.

from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import SystemMessage, UserMessage
from azure.core.credentials import AzureKeyCredential

client = ChatCompletionsClient(endpoint=endpoint, credential=AzureKeyCredential(key))

response = client.complete(
    messages=[
        SystemMessage("You are a helpful assistant."),
        UserMessage("How many feet are in a mile?"),
    ],
)

print(response.choices[0].message.content)
print(f"\nToken usage: {response.usage}")

The following types of messages are supported: SystemMessage,UserMessage, AssistantMessage, ToolMessage, DeveloperMessage. See also samples:

sample_chat_completions_with_tools.py for usage of ToolMessage.
sample_chat_completions_with_image_url.py for usage of UserMessage that includes sending an image URL.
sample_chat_completions_with_image_data.py for usage of UserMessage that includes sending image data read from a local file.
sample_chat_completions_with_audio_data.py for usage of UserMessage that includes sending audio data read from a local file.
sample_chat_completions_with_structured_output.py and sample_chat_completions_with_structured_output_pydantic.py for configuring the service to respond with a JSON-formatted string, adhering to your schema.

Alternatively, you can provide the full request body as a Python dictionary (dict object) instead of using the strongly typed classes like SystemMessage and UserMessage:

response = client.complete(
    {
        "messages": [
            {
                "role": "system",
                "content": "You are an AI assistant that helps people find information. Your replies are short, no more than two sentences.",
            },
            {
                "role": "user",
                "content": "What year was construction of the International Space Station mostly done?",
            },
            {
                "role": "assistant",
                "content": "The main construction of the International Space Station (ISS) was completed between 1998 and 2011. During this period, more than 30 flights by US space shuttles and 40 by Russian rockets were conducted to transport components and modules to the station.",
            },
            {"role": "user", "content": "And what was the estimated cost to build it?"},
        ]
    }
)

Or you can provide just the messages input argument as a list of Python dict:

response = client.complete(
    messages=[
        {
            "role": "system",
            "content": "You are an AI assistant that helps people find information.",
        },
        {
            "role": "user",
            "content": "How many feet are in a mile?",
        },
    ]
)

To generate completions for additional messages, simply call client.complete multiple times using the same client.

Streaming chat completions example

This example demonstrates how to generate a single chat completions with streaming response, for a Serverless API or Managed Compute endpoint, with key authentication, assuming endpoint and key are already defined. You simply need to add stream=True to the complete call to enable streaming.

For Entra ID authentication, GitHub models endpoint or Azure OpenAI endpoint, modify the code to create the client as specified in the above sections.

from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import SystemMessage, UserMessage
from azure.core.credentials import AzureKeyCredential

client = ChatCompletionsClient(endpoint=endpoint, credential=AzureKeyCredential(key))

response = client.complete(
    stream=True,
    messages=[
        SystemMessage("You are a helpful assistant."),
        UserMessage("Give me 5 good reasons why I should exercise every day."),
    ],
)

for update in response:
    if update.choices and update.choices[0].delta:
        print(update.choices[0].delta.content or "", end="", flush=True)
    if update.usage:
        print(f"\n\nToken usage: {update.usage}")

client.close()

In the above for loop that prints the results you should see the answer progressively get longer as updates get streamed to the client.

To generate completions for additional messages, simply call client.complete multiple times using the same client.

Adding model-specific parameters

In this example, extra JSON elements are inserted at the root of the request body by setting model_extras when calling the complete method of the ChatCompletionsClient. These are intended for AI models that require additional model-specific parameters beyond what is defined in the REST API Request Body table.

response = client.complete(
    messages=[
        SystemMessage("You are a helpful assistant."),
        UserMessage("How many feet are in a mile?"),
    ],
    model_extras={"key1": "value1", "key2": "value2"},  # Optional. Additional parameters to pass to the model.
)

In the above example, this will be the JSON payload in the HTTP request:

{
    "messages":
    [
        {"role":"system","content":"You are a helpful assistant."},
        {"role":"user","content":"How many feet are in a mile?"}
    ],
    "key1": "value1",
    "key2": "value2"
}

Note that by default, the service will reject any request payload that includes extra parameters. In order to change the default service behaviour, when the complete method includes model_extras, the client library will automatically add the HTTP request header "extra-parameters": "pass-through".

Use the same method to add additional paramaters in the request of other clients in this package.

Adding HTTP request headers

To add your own HTTP request headers, include a headers keyword in the client constructor, and specify a dict with your header names and values. For example:

client = ChatCompletionsClient(
    endpoint=endpoint,
    credential=AzureKeyCredential(key),
    headers={"header1", "value1", "header2", "value2"}
)

And similarly for the other clients in this package.

Text Embeddings example

This example demonstrates how to get text embeddings, for a Serverless API or Managed Compute endpoint, with key authentication, assuming endpoint and key are already defined. For Entra ID authentication, GitHub models endpoint or Azure OpenAI endpoint, modify the code to create the client as specified in the above sections.

from azure.ai.inference import EmbeddingsClient
from azure.core.credentials import AzureKeyCredential

client = EmbeddingsClient(endpoint=endpoint, credential=AzureKeyCredential(key))

response = client.embed(input=["first phrase", "second phrase", "third phrase"])

for item in response.data:
    length = len(item.embedding)
    print(
        f"data[{item.index}]: length={length}, [{item.embedding[0]}, {item.embedding[1]}, "
        f"..., {item.embedding[length-2]}, {item.embedding[length-1]}]"
    )

The length of the embedding vector depends on the model, but you should see something like this:

data[0]: length=1024, [0.0013399124, -0.01576233, ..., 0.007843018, 0.000238657]
data[1]: length=1024, [0.036590576, -0.0059547424, ..., 0.011405945, 0.004863739]
data[2]: length=1024, [0.04196167, 0.029083252, ..., -0.0027484894, 0.0073127747]

To generate embeddings for additional phrases, simply call client.embed multiple times using the same client.

Image Embeddings example

This example demonstrates how to get image embeddings, for a Serverless API or Managed Compute endpoint, with key authentication, assuming endpoint and key are already defined. For Entra ID authentication, GitHub models endpoint or Azure OpenAI endpoint, modify the code to create the client as specified in the above sections.

from azure.ai.inference import ImageEmbeddingsClient
from azure.ai.inference.models import ImageEmbeddingInput
from azure.core.credentials import AzureKeyCredential

client = ImageEmbeddingsClient(endpoint=endpoint, credential=AzureKeyCredential(key))

response = client.embed(input=[ImageEmbeddingInput.load(image_file="sample1.png", image_format="png")])

for item in response.data:
    length = len(item.embedding)
    print(
        f"data[{item.index}]: length={length}, [{item.embedding[0]}, {item.embedding[1]}, "
        f"..., {item.embedding[length-2]}, {item.embedding[length-1]}]"
    )

The length of the embedding vector depends on the model, but you should see something like this:

data[0]: length=1024, [0.0103302, -0.04425049, ..., -0.011543274, -0.0009088516]

To generate image embeddings for additional images, simply call client.embed multiple times using the same client.

Troubleshooting

Exceptions

The complete, embed and get_model_info methods on the clients raise an HttpResponseError exception for a non-success HTTP status code response from the service. The exception's status_code will hold the HTTP response status code (with reason showing the friendly name). The exception's error.message contains a detailed message that may be helpful in diagnosing the issue:

from azure.core.exceptions import HttpResponseError

...

try:
    result = client.complete( ... )
except HttpResponseError as e:
    print(f"Status code: {e.status_code} ({e.reason})")
    print(e.message)

For example, when you provide a wrong authentication key:

Status code: 401 (Unauthorized)
Operation returned an invalid status 'Unauthorized'

Or when you create an EmbeddingsClient and call embed on the client, but the endpoint does not support the /embeddings route:

Status code: 405 (Method Not Allowed)
Operation returned an invalid status 'Method Not Allowed'

Logging

The client uses the standard Python logging library. The SDK logs HTTP request and response details, which may be useful in troubleshooting. To log to stdout, add the following:

import sys
import logging

# Acquire the logger for this client library. Use 'azure' to affect both
# 'azure.core` and `azure.ai.inference' libraries.
logger = logging.getLogger("azure")

# Set the desired logging level. logging.INFO or logging.DEBUG are good options.
logger.setLevel(logging.DEBUG)

# Direct logging output to stdout:
handler = logging.StreamHandler(stream=sys.stdout)
# Or direct logging output to a file:
# handler = logging.FileHandler(filename="sample.log")
logger.addHandler(handler)

# Optional: change the default logging format. Here we add a timestamp.
formatter = logging.Formatter("%(asctime)s:%(levelname)s:%(name)s:%(message)s")
handler.setFormatter(formatter)

By default logs redact the values of URL query strings, the values of some HTTP request and response headers (including Authorization which holds the key or token), and the request and response payloads. To create logs without redaction, do these two things:

Set the method argument logging_enable = True when you construct the client library, or when you call the client's complete or embed methods.
```
client = ChatCompletionsClient(
    endpoint=endpoint,
    credential=AzureKeyCredential(key),
    logging_enable=True
)
```
Set the log level to logging.DEBUG. Logs will be redacted with any other log level.

Be sure to protect non redacted logs to avoid compromising security.

For more information, see Configure logging in the Azure libraries for Python

Reporting issues

To report an issue with the client library, or request additional features, please open a GitHub issue here. Mention "azure-ai-inference" in the title or content.

Observability With OpenTelemetry

The Azure AI Inference client library provides experimental support for tracing with OpenTelemetry.

You can capture prompt and completion contents by setting AZURE_TRACING_GEN_AI_CONTENT_RECORDING_ENABLED environment to true (case insensitive). By default prompts, completions, function name, parameters or outputs are not recorded.

Setup with Azure Monitor

When using Azure AI Inference library with Azure Monitor OpenTelemetry Distro, distributed tracing for Azure AI Inference calls is enabled by default when using latest version of the distro.

Setup with OpenTelemetry

Check out your observability vendor documentation on how to configure OpenTelemetry or refer to the official OpenTelemetry documentation.

Installation

Make sure to install OpenTelemetry and the Azure SDK tracing plugin via

pip install opentelemetry
pip install azure-core-tracing-opentelemetry

You will also need an exporter to send telemetry to your observability backend. You can print traces to the console or use a local viewer such as Aspire Dashboard.

To connect to Aspire Dashboard or another OpenTelemetry compatible backend, install OTLP exporter:

pip install opentelemetry-exporter-otlp

Configuration

To enable Azure SDK tracing set AZURE_SDK_TRACING_IMPLEMENTATION environment variable to opentelemetry.

Or configure it in the code with the following snippet:

from azure.core.settings import settings

settings.tracing_implementation = "opentelemetry"

Please refer to azure-core-tracing-documentation for more information.

The final step is to enable Azure AI Inference instrumentation with the following code snippet:

from azure.ai.inference.tracing import AIInferenceInstrumentor

# Instrument AI Inference API
AIInferenceInstrumentor().instrument()

It is also possible to uninstrument the Azure AI Inferencing API by using the uninstrument call. After this call, the traces will no longer be emitted by the Azure AI Inferencing API until instrument is called again.

AIInferenceInstrumentor().uninstrument()

Tracing Your Own Functions

The @tracer.start_as_current_span decorator can be used to trace your own functions. This will trace the function parameters and their values. You can also add further attributes to the span in the function implementation as demonstrated below. Note that you will have to setup the tracer in your code before using the decorator. More information is available here.

from opentelemetry.trace import get_tracer

tracer = get_tracer(__name__)


# The tracer.start_as_current_span decorator will trace the function call and enable adding additional attributes
# to the span in the function implementation. Note that this will trace the function parameters and their values.
@tracer.start_as_current_span("get_temperature")  # type: ignore
def get_temperature(city: str) -> str:

    # Adding attributes to the current span
    span = trace.get_current_span()
    span.set_attribute("requested_city", city)

    if city == "Seattle":
        return "75"
    elif city == "New York City":
        return "80"
    else:
        return "Unavailable"

Next steps

Have a look at the Samples folder, containing fully runnable Python code for doing inference using synchronous and asynchronous clients.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information, see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Share via

Azure AI Inference client library for Python - version 1.0.0b9

Reporting issues

Getting started

Prerequisites

Install the package

Key concepts

Create and authenticate a client directly, using API key or GitHub token

Create and authenticate a client directly, using Entra ID

Defining default settings while creating the clients

Create and authenticate clients using `load_client`

Get AI model information

Chat Completions

Text Embeddings

Image Embeddings

Examples

Chat completions example

Streaming chat completions example

Adding model-specific parameters

Adding HTTP request headers

Text Embeddings example

Image Embeddings example

Troubleshooting

Exceptions

Logging

Reporting issues

Observability With OpenTelemetry

Setup with Azure Monitor

Setup with OpenTelemetry

Installation

Configuration

Tracing Your Own Functions

Next steps

Contributing

Additional resources

Share via

Azure AI Inference client library for Python - version 1.0.0b9

Reporting issues

Getting started

Prerequisites

Install the package

Key concepts

Create and authenticate a client directly, using API key or GitHub token

Create and authenticate a client directly, using Entra ID

Defining default settings while creating the clients

Create and authenticate clients using load_client

Get AI model information

Chat Completions

Text Embeddings

Image Embeddings

Examples

Chat completions example

Streaming chat completions example

Adding model-specific parameters

Adding HTTP request headers

Text Embeddings example

Image Embeddings example

Troubleshooting

Exceptions

Logging

Reporting issues

Observability With OpenTelemetry

Setup with Azure Monitor

Setup with OpenTelemetry

Installation

Configuration

Tracing Your Own Functions

Next steps

Contributing

Additional resources

Create and authenticate clients using `load_client`