Featured models of Azure AI Foundry

Článek
03/10/2025

The Azure AI model catalog offers a large selection of models from a wide range of providers. You have various options for deploying models from the model catalog. This article lists featured models in the model catalog that can be deployed and hosted on Microsoft's servers via serverless APIs. For some of these models, you can also host them on your infrastructure for deployment via managed compute. See Available models for supported deployment options for a list of models in the catalog that are available for deployment via managed compute or serverless API.

Important

Models that are in preview are marked as preview on their model cards in the model catalog.

To perform inferencing with the models, some models such as Nixtla's TimeGEN-1 and Cohere rerank require you to use custom APIs from the model providers. Others that belong to the following model types support inferencing using the Azure AI model inference:

You can find more details about individual models by reviewing their model cards in the model catalog for Azure AI Foundry portal.

AI21 Labs

The Jamba family models are AI21's production-grade Mamba-based large language model (LLM) which uses AI21's hybrid Mamba-Transformer architecture. It's an instruction-tuned version of AI21's hybrid structured state space model (SSM) transformer Jamba model. The Jamba family models are built for reliable commercial use with respect to quality and performance.

Model	Type	Capabilities
AI21-Jamba-1.5-Mini	chat-completion	- Input: text (262,144 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON, structured outputs
AI21-Jamba-1.5-Large	chat-completion	- Input: text (262,144 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON, structured outputs

See this model collection in Azure AI Foundry portal.

Azure OpenAI

Azure OpenAI Service offers a diverse set of models with different capabilities and price points. These models include:

State-of-the-art models designed to tackle reasoning and problem-solving tasks with increased focus and capability
Models that can understand and generate natural language and code
Models that can transcribe and translate speech to text

Model	Type	Capabilities
o3-mini	chat-completion	- Input: text and image (200,000 tokens) - Output: text (100,000 tokens) - Tool calling: Yes - Response formats: Text, JSON, structured outputs
o1	chat-completion	- Input: text and image (200,000 tokens) - Output: text (100,000 tokens) - Tool calling: Yes - Response formats: Text, JSON, structured outputs
o1-preview	chat-completion	- Input: text (128,000 tokens) - Output: (32,768 tokens) - Tool calling: Yes - Response formats: Text, JSON, structured outputs
o1-mini	chat-completion	- Input: text (128,000 tokens) - Output: (65,536 tokens) - Tool calling: No - Response formats: Text
gpt-4o-realtime-preview	real-time	- Input: control, text, and audio (131,072 tokens) - Output: text and audio (16,384 tokens) - Tool calling: Yes - Response formats: Text, JSON
gpt-4o	chat-completion	- Input: text and image (131,072 tokens) - Output: text (16,384 tokens) - Tool calling: Yes - Response formats: Text, JSON, structured outputs
gpt-4o-mini	chat-completion	- Input: text, image, and audio (131,072 tokens) - Output: (16,384 tokens) - Tool calling: Yes - Response formats: Text, JSON, structured outputs
text-embedding-3-large	embeddings	- Input: text (8,191 tokens) - Output: Vector (3,072 dim.)
text-embedding-3-small	embeddings	- Input: text (8,191 tokens) - Output: Vector (1,536 dim.)

See this model collection in Azure AI Foundry portal.

Cohere

The Cohere family of models includes various models optimized for different use cases, including rerank, chat completions, and embeddings models.

Cohere command and embed

The following table lists the Cohere models that you can inference via the Azure AI model Inference.

Model	Type	Capabilities
Cohere-command-r-plus-08-2024	chat-completion	- Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON
Cohere-command-r-08-2024	chat-completion	- Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON
Cohere-command-r-plus	chat-completion	- Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON
Cohere-command-r	chat-completion	- Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON
Cohere-embed-v3-english	embeddings image-embeddings	- Input: text (512 tokens) - Output: Vector (1,024 dim.)
Cohere-embed-v3-multilingual	embeddings image-embeddings	- Input: text (512 tokens) - Output: Vector (1,024 dim.)

Inference examples: Cohere command and embed

For more examples of how to use Cohere models, see the following examples:

Description	Language	Sample
Web requests	Bash	Command-R Command-R+ cohere-embed.ipynb
Azure AI Inference package for C#	C#	Link
Azure AI Inference package for JavaScript	JavaScript	Link
Azure AI Inference package for Python	Python	Link
OpenAI SDK (experimental)	Python	Link
LangChain	Python	Link
Cohere SDK	Python	Command Embed
LiteLLM SDK	Python	Link

Retrieval Augmented Generation (RAG) and tool use samples: Cohere command and embed

Description	Packages	Sample
Create a local Facebook AI similarity search (FAISS) vector index, using Cohere embeddings - Langchain	`langchain`, `langchain_cohere`	cohere_faiss_langchain_embed.ipynb
Use Cohere Command R/R+ to answer questions from data in local FAISS vector index - Langchain	`langchain`, `langchain_cohere`	command_faiss_langchain.ipynb
Use Cohere Command R/R+ to answer questions from data in AI search vector index - Langchain	`langchain`, `langchain_cohere`	cohere-aisearch-langchain-rag.ipynb
Use Cohere Command R/R+ to answer questions from data in AI search vector index - Cohere SDK	`cohere`, `azure_search_documents`	cohere-aisearch-rag.ipynb
Command R+ tool/function calling, using LangChain	`cohere`, `langchain`, `langchain_cohere`	command_tools-langchain.ipynb

Cohere rerank

The following table lists the Cohere rerank models. To perform inferencing with these rerank models, you're required to use Cohere's custom rerank APIs that are listed in the table.

Model	Type	Inference API
Cohere-rerank-v3.5	rerank text classification	Cohere's v2/rerank API
Cohere-rerank-v3-english	rerank text classification	Cohere's v2/rerank API Cohere's v1/rerank API
Cohere-rerank-v3-multilingual	rerank text classification	Cohere's v2/rerank API Cohere's v1/rerank API

Pricing for Cohere rerank models

Queries, not to be confused with a user's query, is a pricing meter that refers to the cost associated with the tokens used as input for inference of a Cohere Rerank model. Cohere counts a single search unit as a query with up to 100 documents to be ranked. Documents longer than 500 tokens (for Cohere-rerank-v3.5) or longer than 4096 tokens (for Cohere-rerank-v3-English and Cohere-rerank-v3-multilingual) when including the length of the search query are split up into multiple chunks, where each chunk counts as a single document.

See the Cohere model collection in Azure AI Foundry portal.

Core42

Core42 includes autoregressive bi-lingual LLMs for Arabic & English with state-of-the-art capabilities in Arabic.

Model	Type	Capabilities
jais-30b-chat	chat-completion	- Input: text (8,192 tokens) - Output: (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON

See this model collection in Azure AI Foundry portal.

Inference examples: Core42

For more examples of how to use Jais models, see the following examples:

Description	Language	Sample
Azure AI Inference package for C#	C#	Link
Azure AI Inference package for JavaScript	JavaScript	Link
Azure AI Inference package for Python	Python	Link

DeepSeek

DeepSeek family of models includes DeepSeek-R1, which excels at reasoning tasks using a step-by-step training process, such as language, scientific reasoning, and coding tasks, and DeepSeek-V3, a Mixture-of-Experts (MoE) language model.

Model	Type	Capabilities
DeepSeek-V3	chat-completion	- Input: text (131,072 tokens) - Output: (131,072 tokens) - Tool calling: No - Response formats: Text, JSON
DeepSeek-R1	chat-completion with reasoning content	- Input: text (16,384 tokens) - Output: (163,840 tokens) - Tool calling: No - Response formats: Text.

See this model collection in Azure AI Foundry portal.

Inference examples: DeepSeek

For more examples of how to use DeepSeek models, see the following examples:

Description	Language	Sample
Azure AI Inference package for Python	Python	Link
Azure AI Inference package for JavaScript	JavaScript	Link
Azure AI Inference package for C#	C#	Link
Azure AI Inference package for Java	Java	Link

Model	Type	Capabilities
Llama-3.3-70B-Instruct	chat-completion	- Input: text (128,000 tokens) - Output: text (8,192 tokens) - Tool calling: No - Response formats: Text
Llama-3.2-90B-Vision-Instruct	chat-completion	- Input: text and image (128,000 tokens) - Output: (8,192 tokens) - Tool calling: No - Response formats: Text
Llama-3.2-11B-Vision-Instruct	chat-completion	- Input: text and image (128,000 tokens) - Output: (8,192 tokens) - Tool calling: No - Response formats: Text
Meta-Llama-3.1-8B-Instruct	chat-completion	- Input: text (131,072 tokens) - Output: (8,192 tokens) - Tool calling: No - Response formats: Text
Meta-Llama-3.1-70B-Instruct	chat-completion	- Input: text (131,072 tokens) - Output: (8,192 tokens) - Tool calling: No - Response formats: Text
Meta-Llama-3.1-405B-Instruct	chat-completion	- Input: text (131,072 tokens) - Output: (8,192 tokens) - Tool calling: No - Response formats: Text
Meta-Llama-3-8B-Instruct	chat-completion	- Input: text (8,192 tokens) - Output: (8,192 tokens) - Tool calling: No - Response formats: Text
Meta-Llama-3-70B-Instruct	chat-completion	- Input: text (8,192 tokens) - Output: (8,192 tokens) - Tool calling: No - Response formats: Text

Microsoft

Phi is a family of lightweight, state-of-the-art open models. These models were trained with Phi-3 datasets. The datasets include both synthetic data and the filtered, publicly available websites data, with a focus on high quality and reasoning-dense properties. The models underwent a rigorous enhancement process, incorporating both supervised fine-tuning, proximal policy optimization, and direct preference optimization to ensure precise instruction adherence and robust safety measures.

Model	Type	Capabilities
Phi-4-multimodal-instruct	chat-completion (with image and audio content)	- Input: text, images, and audio (131,072 tokens) - Output: (4,096 tokens) - Tool calling: No - Response formats: Text
Phi-4-mini-instruct	chat-completion	- Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: No - Response formats: Text
Phi-4	chat-completion	- Input: text (16,384 tokens) - Output: (16,384 tokens) - Tool calling: No - Response formats: Text
Phi-3.5-mini-instruct	chat-completion	- Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: No - Response formats: Text
Phi-3.5-MoE-instruct	chat-completion	- Input: text (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: No - Response formats: Text
Phi-3.5-vision-instruct	chat-completion	- Input: text and image (131,072 tokens) - Output: (4,096 tokens) - Tool calling: No - Response formats: Text
Phi-3-mini-128k-instruct	chat-completion	- Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: No - Response formats: Text
Phi-3-mini-4k-instruct	chat-completion	- Input: text (4,096 tokens) - Output: (4,096 tokens) - Tool calling: No - Response formats: Text
Phi-3-small-128k-instruct	chat-completion	- Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: No - Response formats: Text
Phi-3-small-8k-instruct	chat-completion	- Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: No - Response formats: Text
Phi-3-medium-128k-instruct	chat-completion	- Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: No - Response formats: Text
Phi-3-medium-4k-instruct	chat-completion	- Input: text (4,096 tokens) - Output: (4,096 tokens) - Tool calling: No - Response formats: Text

See this model collection in Azure AI Foundry portal.

Inference examples: Microsoft Phi

For more examples of how to use Phi-3 family models, see the following examples:

Description	Language	Sample
Azure AI Inference package for C#	C#	Link
Azure AI Inference package for JavaScript	JavaScript	Link
Azure AI Inference package for Python	Python	Link
LangChain	Python	Link
Llama-Index	Python	Link

Mistral AI

Mistral AI offers two categories of models: premium models including Mistral Large and Mistral Small and open models including Mistral Nemo.

Model	Type	Capabilities
Codestral-2501	chat-completion	- Input: text (262,144 tokens) - Output: text (4,096 tokens) - Tool calling: No - Response formats: Text
Ministral-3B	chat-completion	- Input: text (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON
Mistral-Nemo	chat-completion	- Input: text (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON
Mistral-Large-2411	chat-completion	- Input: text (128,000 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON
Mistral-large-2407 (legacy)	chat-completion	- Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON
Mistral-large (deprecated)	chat-completion	- Input: text (32,768 tokens) - Output: (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON
Mistral-small	chat-completion	- Input: text (32,768 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON

See this model collection in Azure AI Foundry portal.

Inference examples: Mistral

For more examples of how to use Mistral models, see the following examples and tutorials:

Description	Language	Sample
CURL request	Bash	Link
Azure AI Inference package for C#	C#	Link
Azure AI Inference package for JavaScript	JavaScript	Link
Azure AI Inference package for Python	Python	Link
Python web requests	Python	Link
OpenAI SDK (experimental)	Python	Mistral - OpenAI SDK sample
LangChain	Python	Mistral - LangChain sample
Mistral AI	Python	Mistral - Mistral AI sample
LiteLLM	Python	Mistral - LiteLLM sample

Nixtla

Nixtla's TimeGEN-1 is a generative pre-trained forecasting and anomaly detection model for time series data. TimeGEN-1 can produce accurate forecasts for new time series without training, using only historical values and exogenous covariates as inputs.

To perform inferencing, TimeGEN-1 requires you to use Nixtla's custom inference API.

Model	Type	Capabilities	Inference API
TimeGEN-1	Forecasting	- Input: Time series data as JSON or dataframes (with support for multivariate input) - Output: Time series data as JSON - Tool calling: No - Response formats: JSON	Forecast client to interact with Nixtla's API

Estimate the number of tokens needed

Before you create a TimeGEN-1 deployment, it's useful to estimate the number of tokens that you plan to consume and be billed for. One token corresponds to one data point in your input dataset or output dataset.

Suppose you have the following input time series dataset:

Unique_id	Timestamp	Target Variable	Exogenous Variable 1	Exogenous Variable 2
BE	2016-10-22 00:00:00	70.00	49593.0	57253.0
BE	2016-10-22 01:00:00	37.10	46073.0	51887.0

To determine the number of tokens, multiply the number of rows (in this example, two) and the number of columns used for forecasting—not counting the unique_id and timestamp columns (in this example, three) to get a total of six tokens.

Given the following output dataset:

Unique_id	Timestamp	Forecasted Target Variable
BE	2016-10-22 02:00:00	46.57
BE	2016-10-22 03:00:00	48.57

You can also determine the number of tokens by counting the number of data points returned after data forecasting. In this example, the number of tokens is two.

Estimate pricing based on tokens

There are four pricing meters that determine the price you pay. These meters are as follows:

Pricing Meter	Description
paygo-inference-input-tokens	Costs associated with the tokens used as input for inference when finetune_steps = 0
paygo-inference-output-tokens	Costs associated with the tokens used as output for inference when finetune_steps = 0
paygo-finetuned-model-inference-input-tokens	Costs associated with the tokens used as input for inference when finetune_steps > 0
paygo-finetuned-model-inference-output-tokens	Costs associated with the tokens used as output for inference when finetune_steps > 0

See the Nixtla model collection in Azure AI Foundry portal.

NTT Data

Tsuzumi is an autoregressive language optimized transformer. The tuned versions use supervised fine-tuning (SFT). Tsuzumi is handles both Japanese and English language with high efficiency.

Model	Type	Capabilities
Tsuzumi-7b	chat-completion	- Input: text (8,192 tokens) - Output: text (8,192 tokens) - Tool calling: No - Response formats: Text

Sdílet prostřednictvím

Featured models of Azure AI Foundry

AI21 Labs

Azure OpenAI

Cohere

Cohere command and embed

Inference examples: Cohere command and embed

Retrieval Augmented Generation (RAG) and tool use samples: Cohere command and embed

Cohere rerank

Pricing for Cohere rerank models

Core42

Inference examples: Core42

DeepSeek

Inference examples: DeepSeek

Meta

Inference examples: Meta Llama

Microsoft

Inference examples: Microsoft Phi

Mistral AI

Inference examples: Mistral

Nixtla

Estimate the number of tokens needed

Estimate pricing based on tokens

NTT Data

Váš názor

Další materiály

Sdílet prostřednictvím

Featured models of Azure AI Foundry

AI21 Labs

Azure OpenAI

Cohere

Cohere command and embed

Inference examples: Cohere command and embed

Retrieval Augmented Generation (RAG) and tool use samples: Cohere command and embed

Cohere rerank

Pricing for Cohere rerank models

Core42

Inference examples: Core42

DeepSeek

Inference examples: DeepSeek

Meta

Inference examples: Meta Llama

Microsoft

Inference examples: Microsoft Phi

Mistral AI

Inference examples: Mistral

Nixtla

Estimate the number of tokens needed

Estimate pricing based on tokens

NTT Data

Related content

Váš názor

Další materiály