Deploy a fine-tuned model for inferencing

Once your model is fine-tuned, you can deploy the model and can use it in your own application.

When you deploy the model, you make the model available for inferencing, and that incurs an hourly hosting charge. Fine-tuned models, however, can be stored in Azure AI Foundry at no cost until you're ready to use them.

Azure OpenAI provides choices of deployment types for fine-tuned models on the hosting structure that fits different business and usage patterns: Standard, Global Standard (preview) and Provisioned Managed (preview). Learn more about deployment types for fine-tuned models and the concepts of all deployment types.

Deploy your fine-tuned model

To deploy your custom model, select the custom model to deploy, and then select Deploy.

The Deploy model dialog box opens. In the dialog box, enter your Deployment name and then select Create to start the deployment of your custom model.

Screenshot that shows how to deploy a custom model in Azure AI Foundry portal.

You can monitor the progress of your deployment on the Deployments pane in Azure AI Foundry portal.

The UI does not support cross region deployment, while Python SDK or REST supports.

Important

After you deploy a customized model, if at any time the deployment remains inactive for greater than fifteen (15) days, the deployment is deleted. The deployment of a customized model is inactive if the model was deployed more than fifteen (15) days ago and no completions or chat completions calls were made to it during a continuous 15-day period.

The deletion of an inactive deployment doesn't delete or affect the underlying customized model, and the customized model can be redeployed at any time. As described in Azure OpenAI Service pricing, each customized (fine-tuned) model that's deployed incurs an hourly hosting cost regardless of whether completions or chat completions calls are being made to the model. To learn more about planning and managing costs with Azure OpenAI, refer to the guidance in Plan to manage costs for Azure OpenAI Service.

Use your deployed fine-tuned model

After your custom model deploys, you can use it like any other deployed model. You can use the Playgrounds in Azure AI Foundry portal to experiment with your new deployment. You can continue to use the same parameters with your custom model, such as temperature and max_tokens, as you can with other deployed models.

Screenshot of the Playground pane in Azure AI Foundry portal, with sections highlighted.

Prompt caching

Azure OpenAI fine-tuning supports prompt caching with select models. Prompt caching allows you to reduce overall request latency and cost for longer prompts that have identical content at the beginning of the prompt. To learn more about prompt caching, see getting started with prompt caching.

Deployment Types

Azure OpenAI fine-tuning supports the following deployment types.

Standard

Standard deployments provides a pay-per-call billing model, and the model available in each region as well as throughput may be limited.

Models Region
GPT-4o-finetune East US2, North Central US, Sweden Central
gpt-4o-mini-2024-07-18 North Central US, Sweden Central
GPT-4-finetune North Central US, Sweden Central
GPT-35-Turbo-finetune East US2, North Central US, Sweden Central, Switzerland West
GPT-35-Turbo-1106-finetune East US2, North Central US, Sweden Central, Switzerland West
GPT-35-Turbo-0125-finetune East US2, North Central US, Sweden Central, Switzerland West

Global Standard (preview)

Models Region
GPT-4o-finetune East US2, North Central US, and Sweden Central
GPT-4o-mini-finetune East US2, North Central US, and Sweden Central

Global standard fine-tuned deployments offer cost savings, but custom model weights may temporarily be stored outside the geography of your Azure OpenAI resource.

Screenshot of the global standard deployment user experience with a fine-tuned model.

Global Standard fine-tuning deployments currently do not support vision and structured outputs.

Provisioned Managed (preview)

Models Region
GPT-4o-finetune North Central US, Switzerland West
GPT-4o-mini-finetune North Central US, Switzerland West
  • gpt-4o-mini-2024-07-18
  • gpt-4o-2024-08-06

Provisioned managed fine-tuned deployments offer predictable performance for fine-tuned deployments. As part of public preview, provisioned managed deployments may be created regionally via the data-plane REST API version 2024-10-01 or newer. See below for examples.

Provisioned Managed fine-tuning deployments currently do not support vision and structured outputs.

Creating a Provisioned Managed deployment

To create a new deployment, make an HTTP PUT call via the Deployments - Create or Update REST API. The approach is similar to performing cross region deployment with the following exceptions:

  • You must provide a sku name of ProvisionedStandard.
  • The capacity must be declared in PTUs.
  • The api-version must be 2024-10-01 or newer.
  • The HTTP method should be PUT.

For example, to deploy a gpt-4o-mini model:

curl -X PUT "https://management.azure.com/subscriptions/<SUBSCRIPTION>/resourceGroups/<RESOURCE_GROUP>/providers/Microsoft.CognitiveServices/accounts/<RESOURCE_NAME>/deployments/<MODEL_DEPLOYMENT_NAME>api-version=2024-10-01" \
  -H "Authorization: Bearer <TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{
    "sku": {"name": "ProvisionedStandard", "capacity": 25},
    "properties": {
        "model": {
            "format": "OpenAI",
            "name": "gpt-4omini-ft-model-name",
            "version": "1",
            "source": "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/{SourceResourceGroupName}/providers/Microsoft.CognitiveServices/accounts/{SourceAOAIAccountName}"
        }
    }
  }'

Scaling a fine-tuned model on Provisioned Managed

To scale a fine-tuned provision managed deployment to increase or decrease PTU capacity, perform the same PUT REST API call as you did when creating the deployment and provide an updated capacity value for the sku. Keep in mind, provisioned deployments must scale in minimum increments.

For example, to scale the model deployed in the previous section from 25 to 40 PTU, make another PUT call and increase the capacity:

curl -X PUT "https://management.azure.com/subscriptions/<SUBSCRIPTION>/resourceGroups/<RESOURCE_GROUP>/providers/Microsoft.CognitiveServices/accounts/<RESOURCE_NAME>/deployments/<MODEL_DEPLOYMENT_NAME>api-version=2024-10-01" \
  -H "Authorization: Bearer <TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{
    "sku": {"name": "ProvisionedStandard", "capacity": 40},
    "properties": {
        "model": {
            "format": "OpenAI",
            "name": "gpt-4omini-ft-model-name",
            "version": "1",
            "source": "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/{SourceResourceGroupName}/providers/Microsoft.CognitiveServices/accounts/{SourceAOAIAccountName}"
        }
    }
  }'

Clean up your deployment

To delete a deployment, use the Deployments - Delete REST API and send an HTTP DELETE to the deployment resource. Like with creating deployments, you must include the following parameters:

  • Azure subscription ID
  • Azure resource group name
  • Azure OpenAI resource name
  • Name of the deployment to delete

Below is the REST API example to delete a deployment:

curl -X DELETE "https://management.azure.com/subscriptions/<SUBSCRIPTION>/resourceGroups/<RESOURCE_GROUP>/providers/Microsoft.CognitiveServices/accounts/<RESOURCE_NAME>/deployments/<MODEL_DEPLOYMENT_NAME>api-version=2024-10-21" \
  -H "Authorization: Bearer <TOKEN>"

You can also delete a deployment in Azure AI Foundry portal, or use Azure CLI.

Next steps