Featured models of Azure AI Foundry
The Azure AI model catalog offers a large selection of models from a wide range of providers. You have various options for deploying models from the model catalog. This article lists featured models in the model catalog that can be deployed and hosted on Microsoft's servers via serverless APIs. For some of these models, you can also host them on your infrastructure for deployment via managed compute. See Available models for supported deployment options for a list of models in the catalog that are available for deployment via managed compute or serverless API.
Important
Models that are in preview are marked as preview on their model cards in the model catalog.
To perform inferencing with the models, some models such as Nixtla's TimeGEN-1 and Cohere rerank require you to use custom APIs from the model providers. Others that belong to the following model types support inferencing using the Azure AI model inference:
- Chat completion
- Chat completion (with reasoning content)
- Chat completion (with image and audio content)
- Embeddings
- Image embeddings
You can find more details about individual models by reviewing their model cards in the model catalog for Azure AI Foundry portal.
AI21 Labs
The Jamba family models are AI21's production-grade Mamba-based large language model (LLM) which uses AI21's hybrid Mamba-Transformer architecture. It's an instruction-tuned version of AI21's hybrid structured state space model (SSM) transformer Jamba model. The Jamba family models are built for reliable commercial use with respect to quality and performance.
Model | Type | Capabilities | |
---|---|---|---|
AI21-Jamba-1.5-Mini | chat-completion | - Input: text (262,144 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON, structured outputs |
|
AI21-Jamba-1.5-Large | chat-completion | - Input: text (262,144 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON, structured outputs |
See this model collection in Azure AI Foundry portal.
Azure OpenAI
Azure OpenAI Service offers a diverse set of models with different capabilities and price points. These models include:
- State-of-the-art models designed to tackle reasoning and problem-solving tasks with increased focus and capability
- Models that can understand and generate natural language and code
- Models that can transcribe and translate speech to text
Model | Type | Capabilities |
---|---|---|
o3-mini | chat-completion | - Input: text and image (200,000 tokens) - Output: text (100,000 tokens) - Tool calling: Yes - Response formats: Text, JSON, structured outputs |
o1 | chat-completion | - Input: text and image (200,000 tokens) - Output: text (100,000 tokens) - Tool calling: Yes - Response formats: Text, JSON, structured outputs |
o1-preview | chat-completion | - Input: text (128,000 tokens) - Output: (32,768 tokens) - Tool calling: Yes - Response formats: Text, JSON, structured outputs |
o1-mini | chat-completion | - Input: text (128,000 tokens) - Output: (65,536 tokens) - Tool calling: No - Response formats: Text |
gpt-4o-realtime-preview | real-time | - Input: control, text, and audio (131,072 tokens) - Output: text and audio (16,384 tokens) - Tool calling: Yes - Response formats: Text, JSON |
gpt-4o | chat-completion | - Input: text and image (131,072 tokens) - Output: text (16,384 tokens) - Tool calling: Yes - Response formats: Text, JSON, structured outputs |
gpt-4o-mini | chat-completion | - Input: text, image, and audio (131,072 tokens) - Output: (16,384 tokens) - Tool calling: Yes - Response formats: Text, JSON, structured outputs |
text-embedding-3-large | embeddings | - Input: text (8,191 tokens) - Output: Vector (3,072 dim.) |
text-embedding-3-small | embeddings | - Input: text (8,191 tokens) - Output: Vector (1,536 dim.) |
See this model collection in Azure AI Foundry portal.
Cohere
The Cohere family of models includes various models optimized for different use cases, including rerank, chat completions, and embeddings models.
Cohere command and embed
The following table lists the Cohere models that you can inference via the Azure AI model Inference.
Model | Type | Capabilities |
---|---|---|
Cohere-command-r-plus-08-2024 | chat-completion | - Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON |
Cohere-command-r-08-2024 | chat-completion | - Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON |
Cohere-command-r-plus | chat-completion | - Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON |
Cohere-command-r | chat-completion | - Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON |
Cohere-embed-v3-english | embeddings image-embeddings |
- Input: text (512 tokens) - Output: Vector (1,024 dim.) |
Cohere-embed-v3-multilingual | embeddings image-embeddings |
- Input: text (512 tokens) - Output: Vector (1,024 dim.) |
Inference examples: Cohere command and embed
For more examples of how to use Cohere models, see the following examples:
Description | Language | Sample |
---|---|---|
Web requests | Bash | Command-R Command-R+ cohere-embed.ipynb |
Azure AI Inference package for C# | C# | Link |
Azure AI Inference package for JavaScript | JavaScript | Link |
Azure AI Inference package for Python | Python | Link |
OpenAI SDK (experimental) | Python | Link |
LangChain | Python | Link |
Cohere SDK | Python | Command Embed |
LiteLLM SDK | Python | Link |
Retrieval Augmented Generation (RAG) and tool use samples: Cohere command and embed
Description | Packages | Sample |
---|---|---|
Create a local Facebook AI similarity search (FAISS) vector index, using Cohere embeddings - Langchain | langchain , langchain_cohere |
cohere_faiss_langchain_embed.ipynb |
Use Cohere Command R/R+ to answer questions from data in local FAISS vector index - Langchain | langchain , langchain_cohere |
command_faiss_langchain.ipynb |
Use Cohere Command R/R+ to answer questions from data in AI search vector index - Langchain | langchain , langchain_cohere |
cohere-aisearch-langchain-rag.ipynb |
Use Cohere Command R/R+ to answer questions from data in AI search vector index - Cohere SDK | cohere , azure_search_documents |
cohere-aisearch-rag.ipynb |
Command R+ tool/function calling, using LangChain | cohere , langchain , langchain_cohere |
command_tools-langchain.ipynb |
Cohere rerank
The following table lists the Cohere rerank models. To perform inferencing with these rerank models, you're required to use Cohere's custom rerank APIs that are listed in the table.
Model | Type | Inference API |
---|---|---|
Cohere-rerank-v3.5 | rerank text classification |
Cohere's v2/rerank API |
Cohere-rerank-v3-english | rerank text classification |
Cohere's v2/rerank API Cohere's v1/rerank API |
Cohere-rerank-v3-multilingual | rerank text classification |
Cohere's v2/rerank API Cohere's v1/rerank API |
Pricing for Cohere rerank models
Queries, not to be confused with a user's query, is a pricing meter that refers to the cost associated with the tokens used as input for inference of a Cohere Rerank model. Cohere counts a single search unit as a query with up to 100 documents to be ranked. Documents longer than 500 tokens (for Cohere-rerank-v3.5) or longer than 4096 tokens (for Cohere-rerank-v3-English and Cohere-rerank-v3-multilingual) when including the length of the search query are split up into multiple chunks, where each chunk counts as a single document.
See the Cohere model collection in Azure AI Foundry portal.
Core42
Core42 includes autoregressive bi-lingual LLMs for Arabic & English with state-of-the-art capabilities in Arabic.
Model | Type | Capabilities |
---|---|---|
jais-30b-chat | chat-completion | - Input: text (8,192 tokens) - Output: (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON |
See this model collection in Azure AI Foundry portal.
Inference examples: Core42
For more examples of how to use Jais models, see the following examples:
Description | Language | Sample |
---|---|---|
Azure AI Inference package for C# | C# | Link |
Azure AI Inference package for JavaScript | JavaScript | Link |
Azure AI Inference package for Python | Python | Link |
DeepSeek
DeepSeek family of models includes DeepSeek-R1, which excels at reasoning tasks using a step-by-step training process, such as language, scientific reasoning, and coding tasks, and DeepSeek-V3, a Mixture-of-Experts (MoE) language model.
Model | Type | Capabilities |
---|---|---|
DeepSeek-V3 | chat-completion | - Input: text (131,072 tokens) - Output: (131,072 tokens) - Tool calling: No - Response formats: Text, JSON |
DeepSeek-R1 | chat-completion with reasoning content | - Input: text (16,384 tokens) - Output: (163,840 tokens) - Tool calling: No - Response formats: Text. |
See this model collection in Azure AI Foundry portal.
Inference examples: DeepSeek
For more examples of how to use DeepSeek models, see the following examples:
Description | Language | Sample |
---|---|---|
Azure AI Inference package for Python | Python | Link |
Azure AI Inference package for JavaScript | JavaScript | Link |
Azure AI Inference package for C# | C# | Link |
Azure AI Inference package for Java | Java | Link |
Meta
Meta Llama models and tools are a collection of pretrained and fine-tuned generative AI text and image reasoning models. Meta models range is scale to include:
- Small language models (SLMs) like 1B and 3B Base and Instruct models for on-device and edge inferencing
- Mid-size large language models (LLMs) like 7B, 8B, and 70B Base and Instruct models
- High-performant models like Meta Llama 3.1-405B Instruct for synthetic data generation and distillation use cases.
Model | Type | Capabilities |
---|---|---|
Llama-3.3-70B-Instruct | chat-completion | - Input: text (128,000 tokens) - Output: text (8,192 tokens) - Tool calling: No - Response formats: Text |
Llama-3.2-90B-Vision-Instruct | chat-completion | - Input: text and image (128,000 tokens) - Output: (8,192 tokens) - Tool calling: No - Response formats: Text |
Llama-3.2-11B-Vision-Instruct | chat-completion | - Input: text and image (128,000 tokens) - Output: (8,192 tokens) - Tool calling: No - Response formats: Text |
Meta-Llama-3.1-8B-Instruct | chat-completion | - Input: text (131,072 tokens) - Output: (8,192 tokens) - Tool calling: No - Response formats: Text |
Meta-Llama-3.1-70B-Instruct | chat-completion | - Input: text (131,072 tokens) - Output: (8,192 tokens) - Tool calling: No - Response formats: Text |
Meta-Llama-3.1-405B-Instruct | chat-completion | - Input: text (131,072 tokens) - Output: (8,192 tokens) - Tool calling: No - Response formats: Text |
Meta-Llama-3-8B-Instruct | chat-completion | - Input: text (8,192 tokens) - Output: (8,192 tokens) - Tool calling: No - Response formats: Text |
Meta-Llama-3-70B-Instruct | chat-completion | - Input: text (8,192 tokens) - Output: (8,192 tokens) - Tool calling: No - Response formats: Text |
See this model collection in Azure AI Foundry portal.
Inference examples: Meta Llama
For more examples of how to use Meta Llama models, see the following examples:
Description | Language | Sample |
---|---|---|
CURL request | Bash | Link |
Azure AI Inference package for C# | C# | Link |
Azure AI Inference package for JavaScript | JavaScript | Link |
Azure AI Inference package for Python | Python | Link |
Python web requests | Python | Link |
OpenAI SDK (experimental) | Python | Link |
LangChain | Python | Link |
LiteLLM | Python | Link |
Microsoft
Phi is a family of lightweight, state-of-the-art open models. These models were trained with Phi-3 datasets. The datasets include both synthetic data and the filtered, publicly available websites data, with a focus on high quality and reasoning-dense properties. The models underwent a rigorous enhancement process, incorporating both supervised fine-tuning, proximal policy optimization, and direct preference optimization to ensure precise instruction adherence and robust safety measures.
Model | Type | Capabilities |
---|---|---|
Phi-4-multimodal-instruct | chat-completion (with image and audio content) | - Input: text, images, and audio (131,072 tokens) - Output: (4,096 tokens) - Tool calling: No - Response formats: Text |
Phi-4-mini-instruct | chat-completion | - Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: No - Response formats: Text |
Phi-4 | chat-completion | - Input: text (16,384 tokens) - Output: (16,384 tokens) - Tool calling: No - Response formats: Text |
Phi-3.5-mini-instruct | chat-completion | - Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: No - Response formats: Text |
Phi-3.5-MoE-instruct | chat-completion | - Input: text (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: No - Response formats: Text |
Phi-3.5-vision-instruct | chat-completion | - Input: text and image (131,072 tokens) - Output: (4,096 tokens) - Tool calling: No - Response formats: Text |
Phi-3-mini-128k-instruct | chat-completion | - Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: No - Response formats: Text |
Phi-3-mini-4k-instruct | chat-completion | - Input: text (4,096 tokens) - Output: (4,096 tokens) - Tool calling: No - Response formats: Text |
Phi-3-small-128k-instruct | chat-completion | - Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: No - Response formats: Text |
Phi-3-small-8k-instruct | chat-completion | - Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: No - Response formats: Text |
Phi-3-medium-128k-instruct | chat-completion | - Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: No - Response formats: Text |
Phi-3-medium-4k-instruct | chat-completion | - Input: text (4,096 tokens) - Output: (4,096 tokens) - Tool calling: No - Response formats: Text |
See this model collection in Azure AI Foundry portal.
Inference examples: Microsoft Phi
For more examples of how to use Phi-3 family models, see the following examples:
Description | Language | Sample |
---|---|---|
Azure AI Inference package for C# | C# | Link |
Azure AI Inference package for JavaScript | JavaScript | Link |
Azure AI Inference package for Python | Python | Link |
LangChain | Python | Link |
Llama-Index | Python | Link |
Mistral AI
Mistral AI offers two categories of models: premium models including Mistral Large and Mistral Small and open models including Mistral Nemo.
Model | Type | Capabilities |
---|---|---|
Codestral-2501 | chat-completion | - Input: text (262,144 tokens) - Output: text (4,096 tokens) - Tool calling: No - Response formats: Text |
Ministral-3B | chat-completion | - Input: text (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON |
Mistral-Nemo | chat-completion | - Input: text (131,072 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON |
Mistral-Large-2411 | chat-completion | - Input: text (128,000 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON |
Mistral-large-2407 (legacy) |
chat-completion | - Input: text (131,072 tokens) - Output: (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON |
Mistral-large (deprecated) |
chat-completion | - Input: text (32,768 tokens) - Output: (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON |
Mistral-small | chat-completion | - Input: text (32,768 tokens) - Output: text (4,096 tokens) - Tool calling: Yes - Response formats: Text, JSON |
See this model collection in Azure AI Foundry portal.
Inference examples: Mistral
For more examples of how to use Mistral models, see the following examples and tutorials:
Description | Language | Sample |
---|---|---|
CURL request | Bash | Link |
Azure AI Inference package for C# | C# | Link |
Azure AI Inference package for JavaScript | JavaScript | Link |
Azure AI Inference package for Python | Python | Link |
Python web requests | Python | Link |
OpenAI SDK (experimental) | Python | Mistral - OpenAI SDK sample |
LangChain | Python | Mistral - LangChain sample |
Mistral AI | Python | Mistral - Mistral AI sample |
LiteLLM | Python | Mistral - LiteLLM sample |
Nixtla
Nixtla's TimeGEN-1 is a generative pre-trained forecasting and anomaly detection model for time series data. TimeGEN-1 can produce accurate forecasts for new time series without training, using only historical values and exogenous covariates as inputs.
To perform inferencing, TimeGEN-1 requires you to use Nixtla's custom inference API.
Model | Type | Capabilities | Inference API |
---|---|---|---|
TimeGEN-1 | Forecasting | - Input: Time series data as JSON or dataframes (with support for multivariate input) - Output: Time series data as JSON - Tool calling: No - Response formats: JSON |
Forecast client to interact with Nixtla's API |
Estimate the number of tokens needed
Before you create a TimeGEN-1 deployment, it's useful to estimate the number of tokens that you plan to consume and be billed for. One token corresponds to one data point in your input dataset or output dataset.
Suppose you have the following input time series dataset:
Unique_id | Timestamp | Target Variable | Exogenous Variable 1 | Exogenous Variable 2 |
---|---|---|---|---|
BE | 2016-10-22 00:00:00 | 70.00 | 49593.0 | 57253.0 |
BE | 2016-10-22 01:00:00 | 37.10 | 46073.0 | 51887.0 |
To determine the number of tokens, multiply the number of rows (in this example, two) and the number of columns used for forecasting—not counting the unique_id and timestamp columns (in this example, three) to get a total of six tokens.
Given the following output dataset:
Unique_id | Timestamp | Forecasted Target Variable |
---|---|---|
BE | 2016-10-22 02:00:00 | 46.57 |
BE | 2016-10-22 03:00:00 | 48.57 |
You can also determine the number of tokens by counting the number of data points returned after data forecasting. In this example, the number of tokens is two.
Estimate pricing based on tokens
There are four pricing meters that determine the price you pay. These meters are as follows:
Pricing Meter | Description |
---|---|
paygo-inference-input-tokens | Costs associated with the tokens used as input for inference when finetune_steps = 0 |
paygo-inference-output-tokens | Costs associated with the tokens used as output for inference when finetune_steps = 0 |
paygo-finetuned-model-inference-input-tokens | Costs associated with the tokens used as input for inference when finetune_steps > 0 |
paygo-finetuned-model-inference-output-tokens | Costs associated with the tokens used as output for inference when finetune_steps > 0 |
See the Nixtla model collection in Azure AI Foundry portal.
NTT Data
Tsuzumi is an autoregressive language optimized transformer. The tuned versions use supervised fine-tuning (SFT). Tsuzumi is handles both Japanese and English language with high efficiency.
Model | Type | Capabilities |
---|---|---|
Tsuzumi-7b | chat-completion | - Input: text (8,192 tokens) - Output: text (8,192 tokens) - Tool calling: No - Response formats: Text |