Model inference endpoint in Azure AI Services

Azure AI model inference in Azure AI services allows customers to consume the most powerful models from flagship model providers using a single endpoint and credentials. This means that you can switch between models and consume them from your application without changing a single line of code.

The article explains how models are organized inside of the service and how to use the inference endpoint to invoke them.

Deployments

Azure AI model inference makes models available using the deployment concept. Deployments are a way to give a model a name under certain configurations. Then, you can invoke such model configuration by indicating its name on your requests.

Deployments capture:

  • A model name
  • A model version
  • A provisioning/capacity type1
  • A content filtering configuration1
  • A rate limiting configuration1

1 Configurations may vary depending on the selected model.

An Azure AI services resource can have as many model deployments as needed and they don't incur in cost unless inference is performed for those models. Deployments are Azure resources and hence they're subject to Azure policies.

To learn more about how to create deployments see Add and configure model deployments.

Azure AI inference endpoint

The Azure AI inference endpoint allows customers to use a single endpoint with the same authentication and schema to generate inference for the deployed models in the resource. This endpoint follows the Azure AI model inference API which all the models in Azure AI model inference support.

You can see the endpoint URL and credentials in the Overview section:

Screenshot showing how to get the URL and key associated with the resource.

Routing

The inference endpoint routes requests to a given deployment by matching the parameter name inside of the request to the name of the deployment. This means that deployments work as an alias of a given model under certain configurations. This flexibility allows you to deploy a given model multiple times in the service but under different configurations if needed.

An illustration showing how routing works for a Meta-llama-3.2-8b-instruct model by indicating such name in the parameter 'model' inside of the payload request.

For example, if you create a deployment named Mistral-large, then such deployment can be invoked as:

Install the package azure-ai-inference using your package manager, like pip:

pip install azure-ai-inference>=1.0.0b5

Warning

Azure AI Services resource requires the version azure-ai-inference>=1.0.0b5 for Python.

Then, you can use the package to consume the model. The following example shows how to create a client to consume chat completions:

import os
from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential

model = ChatCompletionsClient(
    endpoint=os.environ["AZUREAI_ENDPOINT_URL"],
    credential=AzureKeyCredential(os.environ["AZUREAI_ENDPOINT_KEY"]),
)

Explore our samples and read the API reference documentation to get yourself started.

from azure.ai.inference.models import SystemMessage, UserMessage

response = client.complete(
    messages=[
        SystemMessage(content="You are a helpful assistant."),
        UserMessage(content="Explain Riemann's conjecture in 1 paragraph"),
    ],
    model="mistral-large"
)

print(response.choices[0].message.content)

Tip

Deployment routing isn't case sensitive.

SDKs

The Azure AI model inference endpoint is supported by multiple SDKs, including the Azure AI Inference SDK, the Azure AI Foundry SDK, and the Azure OpenAI SDK; which are available in multiple languages. Multiple integrations are also supported in popular frameworks like LangChain, LangGraph, Llama-Index, Semantic Kernel, and AG2. See supported programming languages and SDKs for details.

Azure OpenAI inference endpoint

Azure OpenAI models deployed to AI services also support the Azure OpenAI API. This API exposes the full capabilities of OpenAI models and supports additional features like assistants, threads, files, and batch inference.

Azure OpenAI inference endpoints work at the deployment level and they have their own URL that is associated with each of them. However, the same authentication mechanism can be used to consume them. Learn more in the reference page for Azure OpenAI API

An illustration showing how Azure OpenAI deployments contain a single URL for each deployment.

Each deployment has a URL that is the concatenations of the Azure OpenAI base URL and the route /deployments/<model-deployment-name>.

Important

There's no routing mechanism for the Azure OpenAI endpoint, as each URL is exclusive for each model deployment.

SDKs

The Azure OpenAI endpoint is supported by the OpenAI SDK (AzureOpenAI class) and Azure OpenAI SDKs, which are available in multiple languages. See supported languages for details.

Next steps