Tips & tricks for using the Data Science Virtual Machine (DSVM) with GPU support for machine learning on Azure
Wow, what a wordy title. In this post, I want to share the tricks I’ve learned for using the Data Science Virtual Machine on Azure with GPU hardware.
Rationale
To begin, why would you want to do this? Here’s the value prop:
- Data Science Virtual Machine: the DSVM enables rapid development for data scientists. Say that you usually use { TensorFlow | CNTK | PyTorch | etc } for your deep learning framework, but you found a great sample in one of the other deep learning frameworks. You’d like to quickly try your data with that sample code, but don’t want the overhead of getting a new deep learning framework with all of its dependencies set up on your machine. Use the DSVM! It has a huge list of popular machine learning tools already installed and configured. It is also great for scaling out training of your models on VMs.
- GPU: I’m not going to do an in-depth processor comparison here, but essentially GPUs have great parallel-processing capabilities and can perform faster than CPUs in many batch-processing/data-intensive scenarios. For example, I ran the two following commands – the first with GPU support and the second with CPU only – to train a simple machine learning model in PyTorch, and you can see the resulting speedup from about 56 minutes with CPU to less than 16 minutes with GPU.
GPU: 15m38.556s
th train.lua -input_h5 data/tiny-shakespeare.h5 -input_json data/tiny-shakespeare.json
CPU: 55m51.655s
th train.lua -input_h5 data/tiny-shakespeare.h5 -input_json data/tiny-shakespeare.json -gpu -1
Data Science Virtual Machine versions
There are multiple versions of the Data Science Virtual Machine. There are similar tools on all of them, but for example, the Ubuntu image contains additional deep learning frameworks that aren’t supported on Windows.
- Windows Server 2016 DSVM
- Windows Server 2012 DSVM
- Ubuntu DSVM
- CentOS DSVM
- Deep Learning VM (DLVM): this is a variant of the DSVM with GPU support ready to go (so you don’t need to understand the tips & tricks below!). It is available with a Windows Server 2016 or Ubuntu base image. The DLVM actually uses the same core VM images as the DSVM, but the main differences are that the setup wizard is optimized for easy provisioning on GPU and the DLVM auto-downloads a set of end-to-end deep learning samples from GitHub when the VM instance is created.
- Geo AI DSVM: this is a Windows Server 2016 DSVM with extra support for geospatial analytics. It comes preinstalled with ESRI’s ArcGIS Pro software and several geospatial code samples.
Tips & tricks
To get GPU support, you need both hardware with GPUs in a datacenter, as well as the right software – namely, a virtual machine image that includes GPU drivers so you can use the GPU.
The biggest tip is to use the Deep Learning Virtual Machine! The provisioning experience has been optimized to filter to the options that support GPU (the NC series – see below), which make it easier to set it up correctly.
Outside of the Deep Learning Virtual Machine, the big gotchas to creating a vanilla Data Science Virtual Machine for deep learning on GPU are:
- The virtual machines with GPU support are listed at https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu. For machine learning, the various versions of NC or ND series hardware is your best bet. But it is not available everywhere. Check the list of which datacenters have which hardware on the extremely-useful matrix on the Azure Products by Region Page under Compute, Virtual Machines.
NOTE: this is a screenshot, so it might not be accurate for you, future reader! I also only had the US and Canada regions selected, and there are many more datacenters available. Click here to change the region filters and get the latest data.
- You need to use an image with GPU drivers installed. As documented here, GPU drivers are provided on the following machines: Linux (Ubuntu), Linux (CentOS), Windows 2016, and the Deep Learning VM. The Windows 2012 DSVM does not have GPU support.
- You need to understand if you need HDD or SSD. Even though solid state drives seem “better”, not all GPU machines support them. The different VM series have different requirements for their Azure storage disk support. NC and NV VMs only support VM disks that are backed by Standard Disk Storage (HDD). NCv2, ND, and NCv3 VMs only support VM disks that are backed by Premium Disk Storage (SSD).
Connecting to the DSVM
If you are using a Windows data science virtual machine, once the DSVM is provisioned, you can remote desktop into it.
If you are using a Linux data science virtual machine, once the DSVM is provisioned, you have a couple of choices on how to connect to it. More details are here, but the quick summary is that you can use any of these options:
- For terminal/console sessions: use SSH. In the Azure portal, after provisioning your VM, you can click on the “Connect” button to get the exact ssh command. If you have Bash on Windows support*, you can use ssh right from the Bash app in Windows. Otherwise, you can download a third-party tool like Putty.
- For graphical sessions: use the X2Go client.
- For Jupyter notebooks, use JupyterHub by browsing to https://your-vm-ip:8000 or JupyterLab by browsing to https://your-vm-ip:8000/lab (fill in the appropriate IP address for your virtual machine).
* In Windows 10, you can enable the Windows subsystem for Linux by running this PowerShell command as administrator and rebooting: “Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux”.
Resources
Azure Data Science Virtual Machine documentation: don’t forget to explore the whole tree in the left-hand pane
Data Science Virtual Machine Plans and Pricing: note that this is for the Windows Server 2016 version specifically
Virtual Machines with GPU support
Get to know your DSVM: shows all of the tools, platforms, utilities, and samples that are included in the Data Science Virtual Machine, neatly organized by category
Data Science Virtual Machine product webpage: this is more of a high-level overview