Quickstart: Create a cluster for Batch AI training jobs using the Azure CLI
This quickstart shows how to use the Azure CLI to create a Batch AI cluster you can use for training AI and machine learning models. Batch AI is a managed service for data scientists and AI researchers to train AI and machine learning models at scale on clusters of Azure virtual machines.
The cluster initially has a single GPU node. After completing this quickstart, you'll have a cluster you can scale up and use to train your models. Submit training jobs to the cluster using Batch AI, Azure Machine Learning tools, or the Visual Studio Tools for AI.
Open Azure Cloud Shell
Azure Cloud Shell is a free, interactive shell that you can use to run the steps in this article. Common Azure tools are preinstalled and configured in Cloud Shell for you to use with your account. Just select the Copy button to copy the code, paste it in Cloud Shell, and then press Enter to run it. There are a few ways to open Cloud Shell:
Select Try It in the upper-right corner of a code block. | ![]() |
Open Cloud Shell in your browser. | ![]() |
Select the Cloud Shell button on the menu in the upper-right corner of the Azure portal. | ![]() |
If you choose to install and use the CLI locally, this quickstart requires that you are running the Azure CLI version 2.0.38 or later. Run az --version
to find the version. If you need to install or upgrade, see Install Azure CLI.
This quickstart assumes you're running commands in a Bash shell, either in Cloud Shell or on your local computer.
Create a resource group
Create a resource group with the az group create
command. An Azure resource group is a logical container into which Azure resources are deployed and managed.
The following example creates a resource group named myResourceGroup in the eastus2 location. Be sure to choose a location such as East US 2 in which the Batch AI service is available.
az group create \
--name myResourceGroup \
--location eastus2
Create a Batch AI cluster
First, use the az batchai workspace create
command to create a Batch AI workspace. You need a workspace to organize your Batch AI clusters and other resources.
az batchai workspace create \
--workspace myworkspace \
--resource-group myResourceGroup
To create a Batch AI cluster, use the az batchai cluster create
command. The following example creates a cluster with the following properties:
- Contains a single node in the NC6 VM size, which has one NVIDIA Tesla K80 GPU.
- Runs a default Ubuntu Server image designed to host container-based applications, which you can use for most training workloads.
- Adds a user account named myusername, and generates SSH keys if they don't already exist in the default key location (~/.ssh) in your local environment.
az batchai cluster create \
--name mycluster \
--workspace myworkspace \
--resource-group myResourceGroup \
--vm-size Standard_NC6 \
--target 1 \
--user-name myusername \
--generate-ssh-keys
The command output shows the cluster properties. It takes a few minutes to create and start the node. To see the status of the cluster, run the az batchai cluster show
command.
az batchai cluster show \
--name mycluster \
--workspace myworkspace \
--resource-group myResourceGroup \
--output table
Early in cluster creation, output is similar to the following, showing the cluster is in the resizing
state:
Name Resource Group Workspace VM Size State Idle Running Preparing Leaving Unusable
--------- ---------------- ----------- ------------ ------- ------ --------- ----------- --------- ----------
mycluster myResourceGroup myworkspace STANDARD_NC6 resizing 0 0 0 0 0
The cluster is ready to use when the state is steady
and the single node is Idle
.
List cluster nodes
If you need to connect to the cluster nodes (in this case, a single node) to install applications or perform maintenance, get connection information by running the az batchai cluster node list
command:
az batchai cluster node list \
--cluster mycluster \
--workspace myworkspace \
--resource-group myResourceGroup
JSON output is similar to:
[
{
"ipAddress": "40.68.254.143",
"nodeId": "tvm-1816144089_1-20180626t233430z",
"port": 50000.0
}
]
Use this information to make an SSH connection to the node. For example, substitute the correct IP address of your node in the following command:
ssh myusername@40.68.254.143 -p 50000
Exit the SSH session to continue.
Resize the cluster
When you use your cluster to run a training job, you might need more compute resources. For example, to increase the size to 2 nodes for a distributed training job, run the batch ai cluster resize command:
az batchai cluster resize \
--name mycluster \
--workspace myworkspace \
--resource-group myResourceGroup \
--target 2
It takes a few minutes for the cluster to resize.
Clean up resources
If you want to continue with Batch AI tutorials and samples, use the Batch AI workspace created in this quickstart.
You're charged for the Batch AI cluster while the nodes are running. If you want to keep the cluster configuration when you have no jobs to run, resize the cluster to 0 nodes.
az batchai cluster resize \
--name mycluster \
--workspace myworkspace \
--resource-group myResourceGroup \
--target 0
Later, resize it to 1 or more nodes to run your jobs. When you no longer need a cluster, delete it with the az batchai cluster delete
command:
az batchai cluster delete \
--name mycluster \
--workspace myworkspace \
--resource-group myResourceGroup \
When no longer needed, you can use the az group delete
command to remove the resource group for the Batch AI resources.
az group delete --name myResourceGroup
Next steps
In this quickstart, you learned how to create a Batch AI cluster, using the Azure CLI. To learn about how to use a Batch AI cluster to train a model, continue to the quickstart for training a deep learning model.