How to use the Meta Llama family of models with Azure Machine Learning studio

Article
10/03/2024

In this article, you learn about the Meta Llama models family (LLMs). Meta Llama models and tools are a collection of pretrained and fine-tuned generative AI text and image reasoning models - ranging in scale from SLMs (1B, 3B Base and Instruct models) for on-device and edge inferencing - to mid-size LLMs (7B, 8B and 70B Base and Instruct models) and high performant models like Meta Llama 3.1 405B Instruct for synthetic data generation and distillation use cases.

Tip

See our announcements of Meta's Llama 3.3 family models available now on Azure AI Model Catalog Microsoft Tech Community Blog.

See the following GitHub samples to explore integrations with LangChain, LiteLLM, OpenAI and the Azure API.

Important

This feature is currently in public preview. This preview version is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Meta Llama family of models

The Meta Llama family of models include the following models:

Llama-3.3-70B-Instruct

An Azure subscription with a valid payment method. Free or trial Azure subscriptions won't work. If you don't have an Azure subscription, create a paid Azure account to begin.
An Azure Machine Learning workspace and a compute instance. If you don't have these, use the steps in the Quickstart: Create workspace resources article to create them. The serverless API model deployment offering for Meta Llama 3.1 and Llama 3 is only available with workspaces created in these regions:
- East US
- East US 2
- North Central US
- South Central US
- West US
- West US 3
- Sweden Central
For a list of regions that are available for each of the models supporting serverless API endpoint deployments, see Region availability for models in serverless API endpoints.
Azure role-based access controls (Azure RBAC) are used to grant access to operations in Azure Machine Learning. To perform the steps in this article, your user account must be assigned the owner or contributor role for the Azure subscription. Alternatively, your account can be assigned a custom role that has the following permissions:
- On the Azure subscription—to subscribe the workspace to the Azure Marketplace offering, once for each workspace, per offering:
  - Microsoft.MarketplaceOrdering/agreements/offers/plans/read
  - Microsoft.MarketplaceOrdering/agreements/offers/plans/sign/action
  - Microsoft.MarketplaceOrdering/offerTypes/publishers/offers/plans/agreements/read
  - Microsoft.Marketplace/offerTypes/publishers/offers/plans/agreements/read
  - Microsoft.SaaS/register/action
- On the resource group—to create and use the SaaS resource:
  - Microsoft.SaaS/resources/read
  - Microsoft.SaaS/resources/write
- On the workspace—to deploy endpoints (the Azure Machine Learning data scientist role contains these permissions already):
  - Microsoft.MachineLearningServices/workspaces/marketplaceModelSubscriptions/*
  - Microsoft.MachineLearningServices/workspaces/serverlessEndpoints/*
For more information on permissions, see Manage access to an Azure Machine Learning workspace.

An Azure subscription with a valid payment method. Free or trial Azure subscriptions won't work. If you don't have an Azure subscription, create a paid Azure account to begin.
An Azure Machine Learning workspace and a compute instance. If you don't have these, use the steps in the Quickstart: Create workspace resources article to create them. The serverless API model deployment offering for Meta Llama 2 is only available with workspaces created in these regions:
- East US
- East US 2
- North Central US
- South Central US
- West US
- West US 3
For a list of regions that are available for each of the models supporting serverless API endpoint deployments, see Region availability for models in serverless API endpoints.
Azure role-based access controls (Azure RBAC) are used to grant access to operations in Azure Machine Learning. To perform the steps in this article, your user account must be assigned the owner or contributor role for the Azure subscription. Alternatively, your account can be assigned a custom role that has the following permissions:
- On the Azure subscription—to subscribe the workspace to the Azure Marketplace offering, once for each workspace, per offering:
  - Microsoft.MarketplaceOrdering/agreements/offers/plans/read
  - Microsoft.MarketplaceOrdering/agreements/offers/plans/sign/action
  - Microsoft.MarketplaceOrdering/offerTypes/publishers/offers/plans/agreements/read
  - Microsoft.Marketplace/offerTypes/publishers/offers/plans/agreements/read
  - Microsoft.SaaS/register/action
- On the resource group—to create and use the SaaS resource:
  - Microsoft.SaaS/resources/read
  - Microsoft.SaaS/resources/write
- On the workspace—to deploy endpoints (the Azure Machine Learning data scientist role contains these permissions already):
  - Microsoft.MachineLearningServices/workspaces/marketplaceModelSubscriptions/*
  - Microsoft.MachineLearningServices/workspaces/serverlessEndpoints/*
For more information on permissions, see Manage access to an Azure Machine Learning workspace.

Create a new deployment

To create a deployment:

Meta Llama 3
Meta Llama 2

Go to Azure Machine Learning studio.
Select the workspace in which you want to deploy your models. To use the pay-as-you-go model deployment offering, your workspace must belong to one of the available regions listed in the prerequisites of this article.
Choose Meta-Llama-3.1-405B-Instruct to deploy from the model catalog.

Alternatively, you can initiate deployment by going to your workspace and selecting Endpoints > Serverless endpoints > Create.
On the Details page for Meta-Llama-3.1-405B-Instruct, select Deploy and then select Serverless API with Azure AI Content Safety.
On the deployment wizard, select the link to Azure Marketplace Terms to learn more about the terms of use. You can also select the Marketplace offer details tab to learn about pricing for the selected model.
If this is your first time deploying the model in the workspace, you have to subscribe your workspace for the particular offering (for example, Meta-Llama-3.1-405B-Instruct) from Azure Marketplace. This step requires that your account has the Azure subscription permissions and resource group permissions listed in the prerequisites. Each workspace has its own subscription to the particular Azure Marketplace offering, which allows you to control and monitor spending. Select Subscribe and Deploy.

Note

Subscribing a workspace to a particular Azure Marketplace offering (in this case, Llama-3-70B) requires that your account has Contributor or Owner access at the subscription level where the project is created. Alternatively, your user account can be assigned a custom role that has the Azure subscription permissions and resource group permissions listed in the prerequisites.
Once you sign up the workspace for the particular Azure Marketplace offering, subsequent deployments of the same offering in the same workspace don't require subscribing again. Therefore, you don't need to have the subscription-level permissions for subsequent deployments. If this scenario applies to you, select Continue to deploy.
Give the deployment a name. This name becomes part of the deployment API URL. This URL must be unique in each Azure region.
Select Deploy. Wait until the deployment is finished and you're redirected to the serverless endpoints page.
Select the endpoint to open its Details page.
Select the Test tab to start interacting with the model.
You can also take note of the Target URL and the Secret Key to call the deployment and generate completions.
You can always find the endpoint's details, URL, and access keys by navigating to Workspace > Endpoints > Serverless endpoints.

To learn about billing for Meta Llama models deployed as a serverless API, see Cost and quota considerations for Meta Llama models deployed as a serverless API.

Consume Meta Llama models as a service

Models deployed as a service can be consumed using either the chat or the completions API, depending on the type of model you deployed.

Meta Llama 3
Meta Llama 2

In the workspace, select Endpoints > Serverless endpoints.
Find and select the Meta-Llama-3.1-405B-Instruct deployment you created.
Copy the Target URL and the Key token values.
Make an API request based on the type of model you deployed.
- For completions models, such as Llama-3-8B, use the <target_url>/v1/completions API.
- For chat models, such as Meta-Llama-3.1-405B-Instruct, use the /chat/completions API.
For more information on using the APIs, see the reference section.

In the workspace, select Endpoints > Serverless endpoints.
Find and select the deployment you created.
Copy the Target URL and the Key token values.
Make an API request based on the type of model you deployed.
- For completions models, such as Meta-Llama-2-7B, use the /v1/completions API or the Azure AI Model Inference API on the route /completions.
- For chat models, such as Meta-Llama-2-7B-Chat, use the /v1/chat/completions API or the Azure AI Model Inference API on the route /chat/completions.
For more information on using the APIs, see the reference section.

Reference for Meta Llama 3.1 models deployed a serverless API

Llama models accept both the Azure AI Model Inference API on the route /chat/completions or a Llama Chat API on /v1/chat/completions. In the same way, text completions can be generated using the Azure AI Model Inference API on the route /completions or a Llama Completions API on /v1/completions

The Azure AI Model Inference API schema can be found in the reference for Chat Completions article and an OpenAPI specification can be obtained from the endpoint itself.

Completions API

Use the method POST to send the request to the /v1/completions route:

Request

POST /v1/completions HTTP/1.1
Host: <DEPLOYMENT_URI>
Authorization: Bearer <TOKEN>
Content-type: application/json

Request schema

Payload is a JSON formatted string containing the following parameters:

Key	Type	Default	Description
`prompt`	`string`	No default. This value must be specified.	The prompt to send to the model.
`stream`	`boolean`	`False`	Streaming allows the generated tokens to be sent as data-only server-sent events whenever they become available.
`max_tokens`	`integer`	`16`	The maximum number of tokens to generate in the completion. The token count of your prompt plus `max_tokens` can't exceed the model's context length.
`top_p`	`float`	`1`	An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with `top_p` probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering `top_p` or `temperature`, but not both.
`temperature`	`float`	`1`	The sampling temperature to use, between 0 and 2. Higher values mean the model samples more broadly the distribution of tokens. Zero means greedy sampling. We recommend altering this or `top_p`, but not both.
`n`	`integer`	`1`	How many completions to generate for each prompt. Note: Because this parameter generates many completions, it can quickly consume your token quota.
`stop`	`array`	`null`	String or a list of strings containing the word where the API stops generating further tokens. The returned text won't contain the stop sequence.
`best_of`	`integer`	`1`	Generates `best_of` completions server-side and returns the "best" (the one with the lowest log probability per token). Results can't be streamed. When used with `n`, `best_of` controls the number of candidate completions and `n` specifies how many to return—best_of must be greater than `n`. Note: Because this parameter generates many completions, it can quickly consume your token quota.
`logprobs`	`integer`	`null`	A number indicating to include the log probabilities on the `logprobs` most likely tokens and the chosen tokens. For example, if `logprobs` is 10, the API returns a list of the 10 most likely tokens. the API always returns the logprob of the sampled token, so there might be up to `logprobs`+1 elements in the response.
`presence_penalty`	`float`	`null`	Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
`ignore_eos`	`boolean`	`True`	Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.
`use_beam_search`	`boolean`	`False`	Whether to use beam search instead of sampling. In such case, `best_of` must be greater than `1` and `temperature` must be `0`.
`stop_token_ids`	`array`	`null`	List of IDs for tokens that, when generated, stop further token generation. The returned output contains the stop tokens unless the stop tokens are special tokens.
`skip_special_tokens`	`boolean`	`null`	Whether to skip special tokens in the output.

Example

Body

{
    "prompt": "What's the distance to the moon?",
    "temperature": 0.8,
    "max_tokens": 512,
}

Response schema

The response payload is a dictionary with the following fields.

Key	Type	Description
`id`	`string`	A unique identifier for the completion.
`choices`	`array`	The list of completion choices the model generated for the input prompt.
`created`	`integer`	The Unix timestamp (in seconds) of when the completion was created.
`model`	`string`	The model_id used for completion.
`object`	`string`	The object type, which is always `text_completion`.
`usage`	`object`	Usage statistics for the completion request.

Tip

In the streaming mode, for each chunk of response, finish_reason is always null, except from the last one which is terminated by a payload [DONE].

The choices object is a dictionary with the following fields.

Key	Type	Description
`index`	`integer`	Choice index. When `best_of` > 1, the index in this array might not be in order and might not be 0 to n-1.
`text`	`string`	Completion result.
`finish_reason`	`string`	The reason the model stopped generating tokens: - `stop`: model hit a natural stop point, or a provided stop sequence. - `length`: if max number of tokens have been reached. - `content_filter`: When RAI moderates and CMP forces moderation. - `content_filter_error`: an error during moderation and wasn't able to make decision on the response. - `null`: API response still in progress or incomplete.
`logprobs`	`object`	The log probabilities of the generated tokens in the output text.

The usage object is a dictionary with the following fields.

Key	Type	Value
`prompt_tokens`	`integer`	Number of tokens in the prompt.
`completion_tokens`	`integer`	Number of tokens generated in the completion.
`total_tokens`	`integer`	Total tokens.

The logprobs object is a dictionary with the following fields:

Key	Type	Value
`text_offsets`	`array` of `integers`	The position or index of each token in the completion output.
`token_logprobs`	`array` of `float`	Selected `logprobs` from dictionary in `top_logprobs` array.
`tokens`	`array` of `string`	Selected tokens.
`top_logprobs`	`array` of `dictionary`	Array of dictionary. In each dictionary, the key is the token and the value is the prob.

Example

{
    "id": "12345678-1234-1234-1234-abcdefghijkl",
    "object": "text_completion",
    "created": 217877,
    "choices": [
        {
            "index": 0,
            "text": "The Moon is an average of 238,855 miles away from Earth, which is about 30 Earths away.",
            "logprobs": null,
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 7,
        "total_tokens": 23,
        "completion_tokens": 16
    }
}

Chat API

Use the method POST to send the request to the /v1/chat/completions route:

Request

POST /v1/chat/completions HTTP/1.1
Host: <DEPLOYMENT_URI>
Authorization: Bearer <TOKEN>
Content-type: application/json

Request schema

Payload is a JSON formatted string containing the following parameters:

Key	Type	Default	Description
`messages`	`string`	No default. This value must be specified.	The message or history of messages to use to prompt the model.
`stream`	`boolean`	`False`	Streaming allows the generated tokens to be sent as data-only server-sent events whenever they become available.
`max_tokens`	`integer`	`16`	The maximum number of tokens to generate in the completion. The token count of your prompt plus `max_tokens` can't exceed the model's context length.
`top_p`	`float`	`1`	An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with `top_p` probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering `top_p` or `temperature`, but not both.
`temperature`	`float`	`1`	The sampling temperature to use, between 0 and 2. Higher values mean the model samples more broadly the distribution of tokens. Zero means greedy sampling. We recommend altering this or `top_p`, but not both.
`n`	`integer`	`1`	How many completions to generate for each prompt. Note: Because this parameter generates many completions, it can quickly consume your token quota.
`stop`	`array`	`null`	String or a list of strings containing the word where the API stops generating further tokens. The returned text won't contain the stop sequence.
`best_of`	`integer`	`1`	Generates `best_of` completions server-side and returns the "best" (the one with the lowest log probability per token). Results can't be streamed. When used with `n`, `best_of` controls the number of candidate completions and `n` specifies how many to return—`best_of` must be greater than `n`. Note: Because this parameter generates many completions, it can quickly consume your token quota.
`logprobs`	`integer`	`null`	A number indicating to include the log probabilities on the `logprobs` most likely tokens and the chosen tokens. For example, if `logprobs` is 10, the API returns a list of the 10 most likely tokens. the API will always return the logprob of the sampled token, so there might be up to `logprobs`+1 elements in the response.
`presence_penalty`	`float`	`null`	Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
`ignore_eos`	`boolean`	`True`	Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.
`use_beam_search`	`boolean`	`False`	Whether to use beam search instead of sampling. In such case, `best_of` must be greater than `1` and `temperature` must be `0`.
`stop_token_ids`	`array`	`null`	List of IDs for tokens that, when generated, stop further token generation. The returned output contains the stop tokens unless the stop tokens are special tokens.
`skip_special_tokens`	`boolean`	`null`	Whether to skip special tokens in the output.

The messages object has the following fields:

Key	Type	Value
`content`	`string`	The contents of the message. Content is required for all messages.
`role`	`string`	The role of the message's author. One of `system`, `user`, or `assistant`.

Example

Body

{
    "messages":
    [
        { 
        "role": "system", 
        "content": "You are a helpful assistant that translates English to Italian."},
        {
        "role": "user", 
        "content": "Translate the following sentence from English to Italian: I love programming."
        }
    ],
    "temperature": 0.8,
    "max_tokens": 512,
}

Response schema

The response payload is a dictionary with the following fields.

Key	Type	Description
`id`	`string`	A unique identifier for the completion.
`choices`	`array`	The list of completion choices the model generated for the input messages.
`created`	`integer`	The Unix timestamp (in seconds) of when the completion was created.
`model`	`string`	The model_id used for completion.
`object`	`string`	The object type, which is always `chat.completion`.
`usage`	`object`	Usage statistics for the completion request.

Tip

In the streaming mode, for each chunk of response, finish_reason is always null, except from the last one which is terminated by a payload [DONE]. In each choices object, the key for messages is changed by delta.

The choices object is a dictionary with the following fields.

Key	Type	Description
`index`	`integer`	Choice index. When `best_of` > 1, the index in this array might not be in order and might not be `0` to `n-1`.
`messages` or `delta`	`string`	Chat completion result in `messages` object. When streaming mode is used, `delta` key is used.
`finish_reason`	`string`	The reason the model stopped generating tokens: - `stop`: model hit a natural stop point or a provided stop sequence. - `length`: if max number of tokens have been reached. - `content_filter`: When RAI moderates and CMP forces moderation - `content_filter_error`: an error during moderation and wasn't able to make decision on the response - `null`: API response still in progress or incomplete.
`logprobs`	`object`	The log probabilities of the generated tokens in the output text.

The usage object is a dictionary with the following fields.

Key	Type	Value
`prompt_tokens`	`integer`	Number of tokens in the prompt.
`completion_tokens`	`integer`	Number of tokens generated in the completion.
`total_tokens`	`integer`	Total tokens.

The logprobs object is a dictionary with the following fields:

Key	Type	Value
`text_offsets`	`array` of `integers`	The position or index of each token in the completion output.
`token_logprobs`	`array` of `float`	Selected `logprobs` from dictionary in `top_logprobs` array.
`tokens`	`array` of `string`	Selected tokens.
`top_logprobs`	`array` of `dictionary`	Array of dictionary. In each dictionary, the key is the token and the value is the prob.

Example

The following is an example response:

{
    "id": "12345678-1234-1234-1234-abcdefghijkl",
    "object": "chat.completion",
    "created": 2012359,
    "model": "",
    "choices": [
        {
            "index": 0,
            "finish_reason": "stop",
            "message": {
                "role": "assistant",
                "content": "Sure, I\'d be happy to help! The translation of ""I love programming"" from English to Italian is:\n\n""Amo la programmazione.""\n\nHere\'s a breakdown of the translation:\n\n* ""I love"" in English becomes ""Amo"" in Italian.\n* ""programming"" in English becomes ""la programmazione"" in Italian.\n\nI hope that helps! Let me know if you have any other sentences you\'d like me to translate."
            }
        }
    ],
    "usage": {
        "prompt_tokens": 10,
        "total_tokens": 40,
        "completion_tokens": 30
    }
}

Deploy Meta Llama models to managed compute

Apart from deploying with the pay-as-you-go managed service, you can also deploy Meta Llama 3.1 models to managed compute in Azure Machine Learning studio. When deployed to managed compute, you can select all the details about the infrastructure running the model, including the virtual machines to use and the number of instances to handle the load you're expecting. Models deployed to managed compute consume quota from your subscription. The following models from the 3.1 release wave are available on managed compute:

Meta-Llama-3.1-8B-Instruct (FT supported)
Meta-Llama-3.1-70B-Instruct (FT supported)
Meta-Llama-3.1-8B (FT supported)
Meta-Llama-3.1-70B (FT supported)
Llama Guard 3 8B
Prompt Guard

Create a new deployment

Meta Llama 3
Meta Llama 2

Follow these steps to deploy a model such as Meta-Llama-3.1-70B-Instruct to a managed compute in Azure Machine Learning studio.

Select the workspace in which you want to deploy the model.
Choose the model that you want to deploy from the studio's model catalog.

Alternatively, you can initiate deployment by going to your workspace and selecting Endpoints > Managed Comput > Create.
On the model's overview page, select Deploy and then Managed Compute without Azure AI Content Safety.
On the Deploy with Azure AI Content Safety (preview) page, select Skip Azure AI Content Safety so that you can continue to deploy the model using the UI.

Tip

In general, we recommend that you select Enable Azure AI Content Safety (Recommended) for deployment of the Meta Llama model. This deployment option is currently only supported using the Python SDK and it happens in a notebook.
Select Proceed.

Tip

If you don't have enough quota available in the selected project, you can use the option I want to use shared quota and I acknowledge that this endpoint will be deleted in 168 hours.
Select the Virtual machine and the Instance count that you want to assign to the deployment.
Select if you want to create this deployment as part of a new endpoint or an existing one. Endpoints can host multiple deployments while keeping resource configuration exclusive for each of them. Deployments under the same endpoint share the endpoint URI and its access keys.
Indicate if you want to enable Inferencing data collection (preview).
Indicate if you want to enable Package Model (preview).
Select Deploy. After a few moments, the endpoint's Details page opens up.
Wait for the endpoint creation and deployment to finish. This step can take a few minutes.
Select the endpoint's Consume page to obtain code samples that you can use to consume the deployed model in your application.

For more information on how to deploy models to managed compute using the studio, see Deploying foundation models to endpoints for inferencing.

Follow these steps to deploy a model such as Llama-2-7b-chat to a managed compute in Azure Machine Learning studio.

Select the workspace in which you want to deploy the model.
Choose the model that you want to deploy from the studio's model catalog.

Alternatively, you can initiate deployment by going to your workspace and selecting Endpoints > managed compute > Create.
On the model's overview page, select Deploy and then Managed Compute without Azure AI Content Safety.
On the Deploy with Azure AI Content Safety (preview) page, select Skip Azure AI Content Safety so that you can continue to deploy the model using the UI.

Tip

In general, we recommend that you select Enable Azure AI Content Safety (Recommended) for deployment of the Meta Llama model. This deployment option is currently only supported using the Python SDK and it happens in a notebook.
Select Proceed.

Tip

If you don't have enough quota available in the selected project, you can use the option I want to use shared quota and I acknowledge that this endpoint will be deleted in 168 hours.
Select the Virtual machine and the Instance count that you want to assign to the deployment.
Select if you want to create this deployment as part of a new endpoint or an existing one. Endpoints can host multiple deployments while keeping resource configuration exclusive for each of them. Deployments under the same endpoint share the endpoint URI and its access keys.
Indicate if you want to enable Inferencing data collection (preview).
Indicate if you want to enable Package Model (preview).
Select Deploy. After a few moments, the endpoint's Details page opens up.
Wait for the endpoint creation and deployment to finish. This step can take a few minutes.
Select the endpoint's Consume page to obtain code samples that you can use to consume the deployed model in your application.

For more information on how to deploy models to managed compute using the studio, see Deploying foundation models to endpoints for inferencing.

Consume Meta Llama models deployed to managed compute

For reference about how to invoke Meta Llama 3 models deployed to managed compute, see the model's card in Azure Machine Learning studio model catalog. Each model's card has an overview page that includes a description of the model, samples for code-based inferencing, fine-tuning, and model evaluation.

Additional inference examples

Package	Sample Notebook
CLI using CURL and Python web requests	webrequests.ipynb
OpenAI SDK (experimental)	openaisdk.ipynb
LangChain	langchain.ipynb
LiteLLM SDK	litellm.ipynb

Cost and quotas

Cost and quota considerations for Meta Llama 3.1 models deployed as a serverless API

Meta Llama 3.1 models deployed as a serverless API are offered by Meta through Azure Marketplace and integrated with Azure Machine Learning studio for use. You can find Azure Marketplace pricing when deploying or fine-tuning models.

Each time a workspace subscribes to a given model offering from Azure Marketplace, a new resource is created to track the costs associated with its consumption. The same resource is used to track costs associated with inference and fine-tuning; however, multiple meters are available to track each scenario independently.

For more information on how to track costs, see Monitor costs for models offered through the Azure Marketplace.

Quota is managed per deployment. Each deployment has a rate limit of 200,000 tokens per minute and 1,000 API requests per minute. However, we currently limit one deployment per model per project. Contact Microsoft Azure Support if the current rate limits aren't sufficient for your scenarios.

Cost and quota considerations for Meta Llama 3.1 models deployed managed compute

For deployment and inferencing of Meta Llama 3.1 models with managed compute, you consume virtual machine (VM) core quota that is assigned to your subscription on a per-region basis. When you sign up for Azure AI Foundry, you receive a default VM quota for several VM families available in the region. You can continue to create deployments until you reach your quota limit. Once you reach this limit, you can request a quota increase.

Content filtering

Models deployed as a serverless API are protected by Azure AI content safety. When deployed to managed compute, you can opt out of this capability. With Azure AI content safety enabled, both the prompt and completion pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering (preview) system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. Learn more about Azure AI Content Safety.

Share via

How to use the Meta Llama family of models with Azure Machine Learning studio

Meta Llama family of models

Prerequisites

Create a new deployment

Consume Meta Llama models as a service

Reference for Meta Llama 3.1 models deployed a serverless API

Completions API

Request schema

Example

Response schema

Example

Chat API

Request schema

Example

Response schema

Example

Deploy Meta Llama models to managed compute

Create a new deployment

Consume Meta Llama models deployed to managed compute

Additional inference examples

Cost and quotas

Cost and quota considerations for Meta Llama 3.1 models deployed as a serverless API

Cost and quota considerations for Meta Llama 3.1 models deployed managed compute

Content filtering

Feedback

Additional resources

Share via

How to use the Meta Llama family of models with Azure Machine Learning studio

Meta Llama family of models

Prerequisites

Create a new deployment

Consume Meta Llama models as a service

Reference for Meta Llama 3.1 models deployed a serverless API

Completions API

Request schema

Example

Response schema

Example

Chat API

Request schema

Example

Response schema

Example

Deploy Meta Llama models to managed compute

Create a new deployment

Consume Meta Llama models deployed to managed compute

Additional inference examples

Cost and quotas

Cost and quota considerations for Meta Llama 3.1 models deployed as a serverless API

Cost and quota considerations for Meta Llama 3.1 models deployed managed compute

Content filtering

Related content

Feedback

Additional resources