แก้ไข

แชร์ผ่าน


Use the Azure AI model inference endpoint to consume models

Azure AI model inference in Azure AI services allows customers to consume the most powerful models from flagship model providers using a single endpoint and credentials. This means that you can switch between models and consume them from your application without changing a single line of code.

This article explains how to use the inference endpoint to invoke them.

Endpoints

Azure AI services expose multiple endpoints depending on the type of work you're looking for:

  • Azure AI model inference endpoint
  • Azure OpenAI endpoint

The Azure AI inference endpoint (usually with the form https://<resource-name>.services.ai.azure.com/models) allows customers to use a single endpoint with the same authentication and schema to generate inference for the deployed models in the resource. All the models support this capability. This endpoint follows the Azure AI model inference API.

Azure OpenAI models deployed to AI services also support the Azure OpenAI API (usually with the form https://<resource-name>.openai.azure.com). This endpoint exposes the full capabilities of OpenAI models and supports more features like assistants, threads, files, and batch inference.

To learn more about how to apply the Azure OpenAI endpoint see Azure OpenAI service documentation.

Using the routing capability in the Azure AI model inference endpoint

The inference endpoint routes requests to a given deployment by matching the parameter name inside of the request to the name of the deployment. This means that deployments work as an alias of a given model under certain configurations. This flexibility allows you to deploy a given model multiple times in the service but under different configurations if needed.

An illustration showing how routing works for a Meta-llama-3.2-8b-instruct model by indicating such name in the parameter 'model' inside of the payload request.

For example, if you create a deployment named Mistral-large, then such deployment can be invoked as:

Install the package azure-ai-inference using your package manager, like pip:

pip install azure-ai-inference>=1.0.0b5

Warning

Azure AI Services resource requires the version azure-ai-inference>=1.0.0b5 for Python.

Then, you can use the package to consume the model. The following example shows how to create a client to consume chat completions:

import os
from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential

model = ChatCompletionsClient(
    endpoint="https://<resource>.services.ai.azure.com/models",
    credential=AzureKeyCredential(os.environ["AZUREAI_ENDPOINT_KEY"]),
)

Explore our samples and read the API reference documentation to get yourself started.

For a chat model, you can create a request as follows:

from azure.ai.inference.models import SystemMessage, UserMessage

response = client.complete(
    messages=[
        SystemMessage(content="You are a helpful assistant."),
        UserMessage(content="Explain Riemann's conjecture in 1 paragraph"),
    ],
    model="mistral-large"
)

print(response.choices[0].message.content)

If you specify a model name that doesn't match any given model deployment, you get an error that the model doesn't exist. You can control which models are available for users by creating model deployments as explained at add and configure model deployments.

Limitations

  • Azure OpenAI Batch can't be used with the Azure AI model inference endpoint. You have to use the dedicated deployment URL as explained at Batch API support in Azure OpenAI documentation.
  • Real-time API isn't supported in the inference endpoint. Use the dedicated deployment URL.

Next steps