Model inference endpoint in Azure AI Services
Azure AI model inference in Azure AI services allows customers to consume the most powerful models from flagship model providers using a single endpoint and credentials. This means that you can switch between models and consume them from your application without changing a single line of code.
The article explains how models are organized inside of the service and how to use the inference endpoint to invoke them.
Deployments
Azure AI model inference makes models available using the deployment concept. Deployments are a way to give a model a name under certain configurations. Then, you can invoke such model configuration by indicating its name on your requests.
Deployments capture:
- A model name
- A model version
- A provisioning/capacity type1
- A content filtering configuration1
- A rate limiting configuration1
1 Configurations may vary depending on the selected model.
An Azure AI services resource can have as many model deployments as needed and they don't incur in cost unless inference is performed for those models. Deployments are Azure resources and hence they're subject to Azure policies.
To learn more about how to create deployments see Add and configure model deployments.
Azure AI inference endpoint
The Azure AI inference endpoint allows customers to use a single endpoint with the same authentication and schema to generate inference for the deployed models in the resource. This endpoint follows the Azure AI model inference API which all the models in Azure AI model inference support.
You can see the endpoint URL and credentials in the Overview section:
Routing
The inference endpoint routes requests to a given deployment by matching the parameter name
inside of the request to the name of the deployment. This means that deployments work as an alias of a given model under certain configurations. This flexibility allows you to deploy a given model multiple times in the service but under different configurations if needed.
For example, if you create a deployment named Mistral-large
, then such deployment can be invoked as:
Install the package azure-ai-inference
using your package manager, like pip:
pip install azure-ai-inference>=1.0.0b5
Warning
Azure AI Services resource requires the version azure-ai-inference>=1.0.0b5
for Python.
Then, you can use the package to consume the model. The following example shows how to create a client to consume chat completions:
import os
from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential
model = ChatCompletionsClient(
endpoint=os.environ["AZUREAI_ENDPOINT_URL"],
credential=AzureKeyCredential(os.environ["AZUREAI_ENDPOINT_KEY"]),
)
Explore our samples and read the API reference documentation to get yourself started.
from azure.ai.inference.models import SystemMessage, UserMessage
response = client.complete(
messages=[
SystemMessage(content="You are a helpful assistant."),
UserMessage(content="Explain Riemann's conjecture in 1 paragraph"),
],
model="mistral-large"
)
print(response.choices[0].message.content)
Tip
Deployment routing isn't case sensitive.
SDKs
The Azure AI model inference endpoint is supported by multiple SDKs, including the Azure AI Inference SDK, the Azure AI Foundry SDK, and the Azure OpenAI SDK; which are available in multiple languages. Multiple integrations are also supported in popular frameworks like LangChain, LangGraph, Llama-Index, Semantic Kernel, and AG2. See supported programming languages and SDKs for details.
Azure OpenAI inference endpoint
Azure OpenAI models deployed to AI services also support the Azure OpenAI API. This API exposes the full capabilities of OpenAI models and supports additional features like assistants, threads, files, and batch inference.
Azure OpenAI inference endpoints work at the deployment level and they have their own URL that is associated with each of them. However, the same authentication mechanism can be used to consume them. Learn more in the reference page for Azure OpenAI API
Each deployment has a URL that is the concatenations of the Azure OpenAI base URL and the route /deployments/<model-deployment-name>
.
Important
There's no routing mechanism for the Azure OpenAI endpoint, as each URL is exclusive for each model deployment.
SDKs
The Azure OpenAI endpoint is supported by the OpenAI SDK (AzureOpenAI
class) and Azure OpenAI SDKs, which are available in multiple languages. See supported languages for details.