Editar

Partilhar via


Featured models of Azure AI Foundry

The Azure AI model catalog offers a large selection of models from a wide range of providers. You have various options for deploying models from the model catalog. This article lists featured models in the model catalog that can be deployed and hosted on Microsoft's servers via serverless APIs. For some of these models, you can also host them on your infrastructure for deployment via managed compute. See Available models for supported deployment options for a list of models in the catalog that are available for deployment via managed compute or serverless API.

Important

Models that are in preview are marked as preview on their model cards in the model catalog.

To perform inferencing with the models, some models such as Nixtla's TimeGEN-1 and Cohere rerank require you to use custom APIs from the model providers. Others that belong to the following model types support inferencing using the Azure AI model inference:

You can find more details about individual models by reviewing their model cards in the model catalog for Azure AI Foundry portal.

An animation showing Azure AI studio model catalog section and the models available.

AI21 Labs

The Jamba family models are AI21's production-grade Mamba-based large language model (LLM) which uses AI21's hybrid Mamba-Transformer architecture. It's an instruction-tuned version of AI21's hybrid structured state space model (SSM) transformer Jamba model. The Jamba family models are built for reliable commercial use with respect to quality and performance.

Model Type Capabilities
AI21-Jamba-1.5-Mini chat-completion - Input: text (262,144 tokens)
- Output: text (4,096 tokens)
- Tool calling: Yes
- Response formats: Text, JSON, structured outputs
AI21-Jamba-1.5-Large chat-completion - Input: text (262,144 tokens)
- Output: text (4,096 tokens)
- Tool calling: Yes
- Response formats: Text, JSON, structured outputs

See this model collection in Azure AI Foundry portal.

Azure OpenAI

Azure OpenAI Service offers a diverse set of models with different capabilities and price points. These models include:

  • State-of-the-art models designed to tackle reasoning and problem-solving tasks with increased focus and capability
  • Models that can understand and generate natural language and code
  • Models that can transcribe and translate speech to text
Model Type Capabilities
o3-mini chat-completion - Input: text and image (200,000 tokens)
- Output: text (100,000 tokens)
- Tool calling: Yes
- Response formats: Text, JSON, structured outputs
o1 chat-completion - Input: text and image (200,000 tokens)
- Output: text (100,000 tokens)
- Tool calling: Yes
- Response formats: Text, JSON, structured outputs
o1-preview chat-completion - Input: text (128,000 tokens)
- Output: (32,768 tokens)
- Tool calling: Yes
- Response formats: Text, JSON, structured outputs
o1-mini chat-completion - Input: text (128,000 tokens)
- Output: (65,536 tokens)
- Tool calling: No
- Response formats: Text
gpt-4o-realtime-preview real-time - Input: control, text, and audio (131,072 tokens)
- Output: text and audio (16,384 tokens)
- Tool calling: Yes
- Response formats: Text, JSON
gpt-4o chat-completion - Input: text and image (131,072 tokens)
- Output: text (16,384 tokens)
- Tool calling: Yes
- Response formats: Text, JSON, structured outputs
gpt-4o-mini chat-completion - Input: text, image, and audio (131,072 tokens)
- Output: (16,384 tokens)
- Tool calling: Yes
- Response formats: Text, JSON, structured outputs
text-embedding-3-large embeddings - Input: text (8,191 tokens)
- Output: Vector (3,072 dim.)
text-embedding-3-small embeddings - Input: text (8,191 tokens)
- Output: Vector (1,536 dim.)

See this model collection in Azure AI Foundry portal.

Cohere

The Cohere family of models includes various models optimized for different use cases, including rerank, chat completions, and embeddings models.

Cohere command and embed

The following table lists the Cohere models that you can inference via the Azure AI model Inference.

Model Type Capabilities
Cohere-command-r-plus-08-2024 chat-completion - Input: text (131,072 tokens)
- Output: (4,096 tokens)
- Tool calling: Yes
- Response formats: Text, JSON
Cohere-command-r-08-2024 chat-completion - Input: text (131,072 tokens)
- Output: (4,096 tokens)
- Tool calling: Yes
- Response formats: Text, JSON
Cohere-command-r-plus chat-completion - Input: text (131,072 tokens)
- Output: (4,096 tokens)
- Tool calling: Yes
- Response formats: Text, JSON
Cohere-command-r chat-completion - Input: text (131,072 tokens)
- Output: (4,096 tokens)
- Tool calling: Yes
- Response formats: Text, JSON
Cohere-embed-v3-english embeddings
image-embeddings
- Input: text (512 tokens)
- Output: Vector (1,024 dim.)
Cohere-embed-v3-multilingual embeddings
image-embeddings
- Input: text (512 tokens)
- Output: Vector (1,024 dim.)

Inference examples: Cohere command and embed

For more examples of how to use Cohere models, see the following examples:

Description Language Sample
Web requests Bash Command-R Command-R+
cohere-embed.ipynb
Azure AI Inference package for C# C# Link
Azure AI Inference package for JavaScript JavaScript Link
Azure AI Inference package for Python Python Link
OpenAI SDK (experimental) Python Link
LangChain Python Link
Cohere SDK Python Command
Embed
LiteLLM SDK Python Link

Retrieval Augmented Generation (RAG) and tool use samples: Cohere command and embed

Description Packages Sample
Create a local Facebook AI similarity search (FAISS) vector index, using Cohere embeddings - Langchain langchain, langchain_cohere cohere_faiss_langchain_embed.ipynb
Use Cohere Command R/R+ to answer questions from data in local FAISS vector index - Langchain langchain, langchain_cohere command_faiss_langchain.ipynb
Use Cohere Command R/R+ to answer questions from data in AI search vector index - Langchain langchain, langchain_cohere cohere-aisearch-langchain-rag.ipynb
Use Cohere Command R/R+ to answer questions from data in AI search vector index - Cohere SDK cohere, azure_search_documents cohere-aisearch-rag.ipynb
Command R+ tool/function calling, using LangChain cohere, langchain, langchain_cohere command_tools-langchain.ipynb

Cohere rerank

The following table lists the Cohere rerank models. To perform inferencing with these rerank models, you're required to use Cohere's custom rerank APIs that are listed in the table.

Model Type Inference API
Cohere-rerank-v3.5 rerank
text classification
Cohere's v2/rerank API
Cohere-rerank-v3-english rerank
text classification
Cohere's v2/rerank API
Cohere's v1/rerank API
Cohere-rerank-v3-multilingual rerank
text classification
Cohere's v2/rerank API
Cohere's v1/rerank API

Pricing for Cohere rerank models

Queries, not to be confused with a user's query, is a pricing meter that refers to the cost associated with the tokens used as input for inference of a Cohere Rerank model. Cohere counts a single search unit as a query with up to 100 documents to be ranked. Documents longer than 500 tokens (for Cohere-rerank-v3.5) or longer than 4096 tokens (for Cohere-rerank-v3-English and Cohere-rerank-v3-multilingual) when including the length of the search query are split up into multiple chunks, where each chunk counts as a single document.

See the Cohere model collection in Azure AI Foundry portal.

Core42

Core42 includes autoregressive bi-lingual LLMs for Arabic & English with state-of-the-art capabilities in Arabic.

Model Type Capabilities
jais-30b-chat chat-completion - Input: text (8,192 tokens)
- Output: (4,096 tokens)
- Tool calling: Yes
- Response formats: Text, JSON

See this model collection in Azure AI Foundry portal.

Inference examples: Core42

For more examples of how to use Jais models, see the following examples:

Description Language Sample
Azure AI Inference package for C# C# Link
Azure AI Inference package for JavaScript JavaScript Link
Azure AI Inference package for Python Python Link

DeepSeek

DeepSeek family of models includes DeepSeek-R1, which excels at reasoning tasks using a step-by-step training process, such as language, scientific reasoning, and coding tasks, and DeepSeek-V3, a Mixture-of-Experts (MoE) language model.

Model Type Capabilities
DeepSeek-V3 chat-completion - Input: text (131,072 tokens)
- Output: (131,072 tokens)
- Tool calling: No
- Response formats: Text, JSON
DeepSeek-R1 chat-completion with reasoning content - Input: text (16,384 tokens)
- Output: (163,840 tokens)
- Tool calling: No
- Response formats: Text.

See this model collection in Azure AI Foundry portal.

Inference examples: DeepSeek

For more examples of how to use DeepSeek models, see the following examples:

Description Language Sample
Azure AI Inference package for Python Python Link
Azure AI Inference package for JavaScript JavaScript Link
Azure AI Inference package for C# C# Link
Azure AI Inference package for Java Java Link

Meta

Meta Llama models and tools are a collection of pretrained and fine-tuned generative AI text and image reasoning models. Meta models range is scale to include:

  • Small language models (SLMs) like 1B and 3B Base and Instruct models for on-device and edge inferencing
  • Mid-size large language models (LLMs) like 7B, 8B, and 70B Base and Instruct models
  • High-performant models like Meta Llama 3.1-405B Instruct for synthetic data generation and distillation use cases.
Model Type Capabilities
Llama-3.3-70B-Instruct chat-completion - Input: text (128,000 tokens)
- Output: text (8,192 tokens)
- Tool calling: No
- Response formats: Text
Llama-3.2-90B-Vision-Instruct chat-completion - Input: text and image (128,000 tokens)
- Output: (8,192 tokens)
- Tool calling: No
- Response formats: Text
Llama-3.2-11B-Vision-Instruct chat-completion - Input: text and image (128,000 tokens)
- Output: (8,192 tokens)
- Tool calling: No
- Response formats: Text
Meta-Llama-3.1-8B-Instruct chat-completion - Input: text (131,072 tokens)
- Output: (8,192 tokens)
- Tool calling: No
- Response formats: Text
Meta-Llama-3.1-70B-Instruct chat-completion - Input: text (131,072 tokens)
- Output: (8,192 tokens)
- Tool calling: No
- Response formats: Text
Meta-Llama-3.1-405B-Instruct chat-completion - Input: text (131,072 tokens)
- Output: (8,192 tokens)
- Tool calling: No
- Response formats: Text
Meta-Llama-3-8B-Instruct chat-completion - Input: text (8,192 tokens)
- Output: (8,192 tokens)
- Tool calling: No
- Response formats: Text
Meta-Llama-3-70B-Instruct chat-completion - Input: text (8,192 tokens)
- Output: (8,192 tokens)
- Tool calling: No
- Response formats: Text

See this model collection in Azure AI Foundry portal.

Inference examples: Meta Llama

For more examples of how to use Meta Llama models, see the following examples:

Description Language Sample
CURL request Bash Link
Azure AI Inference package for C# C# Link
Azure AI Inference package for JavaScript JavaScript Link
Azure AI Inference package for Python Python Link
Python web requests Python Link
OpenAI SDK (experimental) Python Link
LangChain Python Link
LiteLLM Python Link

Microsoft

Phi is a family of lightweight, state-of-the-art open models. These models were trained with Phi-3 datasets. The datasets include both synthetic data and the filtered, publicly available websites data, with a focus on high quality and reasoning-dense properties. The models underwent a rigorous enhancement process, incorporating both supervised fine-tuning, proximal policy optimization, and direct preference optimization to ensure precise instruction adherence and robust safety measures.

Model Type Capabilities
Phi-4-multimodal-instruct chat-completion (with image and audio content) - Input: text, images, and audio (131,072 tokens)
- Output: (4,096 tokens)
- Tool calling: No
- Response formats: Text
Phi-4-mini-instruct chat-completion - Input: text (131,072 tokens)
- Output: (4,096 tokens)
- Tool calling: No
- Response formats: Text
Phi-4 chat-completion - Input: text (16,384 tokens)
- Output: (16,384 tokens)
- Tool calling: No
- Response formats: Text
Phi-3.5-mini-instruct chat-completion - Input: text (131,072 tokens)
- Output: (4,096 tokens)
- Tool calling: No
- Response formats: Text
Phi-3.5-MoE-instruct chat-completion - Input: text (131,072 tokens)
- Output: text (4,096 tokens)
- Tool calling: No
- Response formats: Text
Phi-3.5-vision-instruct chat-completion - Input: text and image (131,072 tokens)
- Output: (4,096 tokens)
- Tool calling: No
- Response formats: Text
Phi-3-mini-128k-instruct chat-completion - Input: text (131,072 tokens)
- Output: (4,096 tokens)
- Tool calling: No
- Response formats: Text
Phi-3-mini-4k-instruct chat-completion - Input: text (4,096 tokens)
- Output: (4,096 tokens)
- Tool calling: No
- Response formats: Text
Phi-3-small-128k-instruct chat-completion - Input: text (131,072 tokens)
- Output: (4,096 tokens)
- Tool calling: No
- Response formats: Text
Phi-3-small-8k-instruct chat-completion - Input: text (131,072 tokens)
- Output: (4,096 tokens)
- Tool calling: No
- Response formats: Text
Phi-3-medium-128k-instruct chat-completion - Input: text (131,072 tokens)
- Output: (4,096 tokens)
- Tool calling: No
- Response formats: Text
Phi-3-medium-4k-instruct chat-completion - Input: text (4,096 tokens)
- Output: (4,096 tokens)
- Tool calling: No
- Response formats: Text

See this model collection in Azure AI Foundry portal.

Inference examples: Microsoft Phi

For more examples of how to use Phi-3 family models, see the following examples:

Description Language Sample
Azure AI Inference package for C# C# Link
Azure AI Inference package for JavaScript JavaScript Link
Azure AI Inference package for Python Python Link
LangChain Python Link
Llama-Index Python Link

Mistral AI

Mistral AI offers two categories of models: premium models including Mistral Large and Mistral Small and open models including Mistral Nemo.

Model Type Capabilities
Codestral-2501 chat-completion - Input: text (262,144 tokens)
- Output: text (4,096 tokens)
- Tool calling: No
- Response formats: Text
Ministral-3B chat-completion - Input: text (131,072 tokens)
- Output: text (4,096 tokens)
- Tool calling: Yes
- Response formats: Text, JSON
Mistral-Nemo chat-completion - Input: text (131,072 tokens)
- Output: text (4,096 tokens)
- Tool calling: Yes
- Response formats: Text, JSON
Mistral-Large-2411 chat-completion - Input: text (128,000 tokens)
- Output: text (4,096 tokens)
- Tool calling: Yes
- Response formats: Text, JSON
Mistral-large-2407
(legacy)
chat-completion - Input: text (131,072 tokens)
- Output: (4,096 tokens)
- Tool calling: Yes
- Response formats: Text, JSON
Mistral-large
(deprecated)
chat-completion - Input: text (32,768 tokens)
- Output: (4,096 tokens)
- Tool calling: Yes
- Response formats: Text, JSON
Mistral-small chat-completion - Input: text (32,768 tokens)
- Output: text (4,096 tokens)
- Tool calling: Yes
- Response formats: Text, JSON

See this model collection in Azure AI Foundry portal.

Inference examples: Mistral

For more examples of how to use Mistral models, see the following examples and tutorials:

Description Language Sample
CURL request Bash Link
Azure AI Inference package for C# C# Link
Azure AI Inference package for JavaScript JavaScript Link
Azure AI Inference package for Python Python Link
Python web requests Python Link
OpenAI SDK (experimental) Python Mistral - OpenAI SDK sample
LangChain Python Mistral - LangChain sample
Mistral AI Python Mistral - Mistral AI sample
LiteLLM Python Mistral - LiteLLM sample

Nixtla

Nixtla's TimeGEN-1 is a generative pre-trained forecasting and anomaly detection model for time series data. TimeGEN-1 can produce accurate forecasts for new time series without training, using only historical values and exogenous covariates as inputs.

To perform inferencing, TimeGEN-1 requires you to use Nixtla's custom inference API.

Model Type Capabilities Inference API
TimeGEN-1 Forecasting - Input: Time series data as JSON or dataframes (with support for multivariate input)
- Output: Time series data as JSON
- Tool calling: No
- Response formats: JSON
Forecast client to interact with Nixtla's API

Estimate the number of tokens needed

Before you create a TimeGEN-1 deployment, it's useful to estimate the number of tokens that you plan to consume and be billed for. One token corresponds to one data point in your input dataset or output dataset.

Suppose you have the following input time series dataset:

Unique_id Timestamp Target Variable Exogenous Variable 1 Exogenous Variable 2
BE 2016-10-22 00:00:00 70.00 49593.0 57253.0
BE 2016-10-22 01:00:00 37.10 46073.0 51887.0

To determine the number of tokens, multiply the number of rows (in this example, two) and the number of columns used for forecasting—not counting the unique_id and timestamp columns (in this example, three) to get a total of six tokens.

Given the following output dataset:

Unique_id Timestamp Forecasted Target Variable
BE 2016-10-22 02:00:00 46.57
BE 2016-10-22 03:00:00 48.57

You can also determine the number of tokens by counting the number of data points returned after data forecasting. In this example, the number of tokens is two.

Estimate pricing based on tokens

There are four pricing meters that determine the price you pay. These meters are as follows:

Pricing Meter Description
paygo-inference-input-tokens Costs associated with the tokens used as input for inference when finetune_steps = 0
paygo-inference-output-tokens Costs associated with the tokens used as output for inference when finetune_steps = 0
paygo-finetuned-model-inference-input-tokens Costs associated with the tokens used as input for inference when finetune_steps > 0
paygo-finetuned-model-inference-output-tokens Costs associated with the tokens used as output for inference when finetune_steps > 0

See the Nixtla model collection in Azure AI Foundry portal.

NTT Data

Tsuzumi is an autoregressive language optimized transformer. The tuned versions use supervised fine-tuning (SFT). Tsuzumi is handles both Japanese and English language with high efficiency.

Model Type Capabilities
Tsuzumi-7b chat-completion - Input: text (8,192 tokens)
- Output: text (8,192 tokens)
- Tool calling: No
- Response formats: Text