Azure GPU Tensorflow Step-by-Step Setup
The following guide has been developed in collaboration with my colleague at Microsoft Christine Matheney and our work at Oxford and Stanford University.
- This guide will walk you through running your code on GPUs in Azure.
- Before we start, it cannot be stressed enough: do not leave the VM running when you are not using it see the following blog on tips for automating and shutting down VMs to save costs.
- The expected time from start to finish is 1-2 hours.
- The most time consuming part will be downloading and installing NVIDIA drivers, CUDA and Tensorflow this guides and repo installs TensorFlow 1.0.
FAQ
- As an administrator (Lead TA/RA or Academic) you need to grant/remove access for an individual (student) follow the directions here and setting up Azure at your institution
- Do not install updates using: sudo apt-get install --upgrade This might break the CUDA driver installation if the kernel is updated.
- If you need to attach additional storage or a larger disk to your VM see. /en-us/azure/virtual-machines/virtual-machines-linux-classic-attach-disk
- To check available disk space Run df -h to see which disks have free space.
- Please only store your data to the attached disk. The temporary disk provided on Azure VMs are not available to store persistent data.
- Problems connecting (e.g., using SSH) to the VM
- Try ping <vm’s ip address>
- Try ssh to the VM
- Try restarting the VM and/or your local machine
- If all of the previous steps fail, file an Azure support ticket via https://portal.azure.com
Creating a Microsoft account
- You should have received an email to your inbox with an invitation to join the Azure subscription from your Azure Administrator.
- Please follow the instructions using the email address that received this invitation.
Getting started
Logging into Azure portal
- Once you have created your account, log in to Azure at: https://portal.azure.com
- After logging in, you should reach the dashboard page.
- If you have multiple subscriptions (e.g., you previously signed up for a free one), you must select the name of your institution. by clicking in the top right quarter. If no such option appears please contact your Azure Admin.
Create a VM
- Once you are logged in, click on the + on the left. Select Ubuntu Server 16.04 LTS.
- You will be presented withe VM Image details simply Click Create.
Fill in the name, user, etc for your VM. You must change the storage type from SSD to HDD. Also, you must use the region that is available for NC or NV
For info NV and NC are available in the below regions
Region
SKU
East US
NV
North Central US
NV
South Central US
NV
South East Asia
NV
West Europe
NV
South Central US
NC
East US
NC
Regarding the question of running GPU compute for deep learning on NV-Series, the GPU team has indicated that is not recommended. Bottom line is: Big GPU Computes (like deep learning) should only be done on NC-Series. NV is for visualization and graphics. See this blog for more details on NV vs NC series
- View all (click the button) of the options and select an appropriate NV or NC Series series server for your workload. By scrolling through the list. If NV or NC does not show up, then you probably chose the wrong region or have selected a SSD not a HDD, in the previous page Step 1 Basic.
- If you do not select NV/NC options, then you are not using a GPU instance and the setup scripts later will fail.
- Select the appropriate VM Size and Click OK.
- Wait for the configuration to validate and then click OK.
Using the VM
Finding your VM
Login to https://portal.azure.com Click all resources and select your VM. Our subscription has many, but yours will only have one if you just followed the setup instructions.
Spinning up your VM
If you just completed the previous part and the VM has finished deploying, then your VM should be running already.
Connecting (SSH) to your VM
Once your VM is started (it may take a few minutes). Click connect and follow the instructions.
Stopping your VM
- Once you are done working, stop your VM. see this blog on tips for stopping/shutting down VMs
- Make sure your VM is fully stopped. If you see “stopped still incurring compute charges”, you must hit stop again.
Completing CUDA/Tensorflow setup
- You will need to SSH into your VM.
##Installing CUDA and Tensorflow dependencies.
There are two scripts that you will need to run and your VM will need to reboot in the between running them.
##[Step 1]
First, in your VM do:
git clone https://github.com/leestott/Azure-GPU-Setup.git
cd azure-gpu-setup
You should see the following if you use
ls -all
Run gpu-setup-part1.sh using the following command:
./gpu-setup-part1.sh
This will install some libraries, fetch and install NVIDIA drivers, and trigger a reboot. (The command will take some time to run.)
Once your VM has finished restarting.
[Step 2]
SSH into the VM again. Navigate to the azure-gpu-setup directory again. Run the command:
./gpu-setup-part2.sh
This script installs the CUDA toolkit, CUDNN, and Tensorflow. It also sets the required environment variables. Once the script finishes, we must do:
source ~/.bashrc
This ensures that the shell will use the updated environment variables. Now, to test that Tensorflow and the GPU is properly configured, run the gpu test script by executing:
python gpu-test.py
Filing a support ticket
- Click on the help icon in the left sidebar and select new support request.
- Follow the on screen instructions.
General recommendations
We highly suggest the following for using the GPU instances:
- Develop and debug your code locally and use scp to copy your code to the VM to run for the long training steps.
- Save your work often and keep a local copy.
- Be mindful of when your instance is running and shut it off when you are not actively using it.
Comments
- Anonymous
March 28, 2017
The comment has been removed - Anonymous
March 28, 2017
If your interested in running TensorFlow on Docker see https://blogs.msdn.microsoft.com/uk_faculty_connection/2016/09/26/tensorflow-on-docker-with-microsoft-azure/ Microsoft has also available TensorFlow images on Azure BactchShipyard Recipes see https://github.com/Azure/batch-shipyard/tree/master/recipes/TensorFlow-GPU and more details at https://blogs.msdn.microsoft.com/uk_faculty_connection/2017/02/13/deep-learning-using-cntk-caffe-keras-theanotorch-tensorflow-on-docker-with-microsoft-azure-batch-shipyard/ - Anonymous
March 30, 2017
If you require more than 18 servers you can request a quota extension Go to https://ms.portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade/overview Click on + New support request Issue type = Quota Choose your subscription (if you have more than one) Follow the prompts/fillin the fields - Anonymous
April 05, 2017
Hi Lee,thanks a lot for the excellent guide! - Anonymous
April 19, 2017
If you get the following error "ERROR: Unable to load the 'nvidia-drm' kernel module.ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README availableCheck the following.. Ensure you have selected NC’s Virtual Machines. This error is presented when you try to install on NV machine as described above NV are for visualisation - Anonymous
April 20, 2017
If your interested in running tensorflow from a container/docker solution infrastructure the following tutorial and github resources are a prefect starting point. http://wbuchwalter.github.io/container/docker/machine/learning/kubernetes/gpu/training/2016/03/23/gpu-ml-training-cluster/ - Anonymous
May 03, 2017
A new Ubuntu based Data Science Virtual machine (DSVM) was released mid April. This has everything builtin including Nvidia drivers, CUDA, Tensorflow, Microsoft CNTK, Caffe2, Torch, Theano, keras etc. You can find the Ubuntu based DSVM at: http://aka.ms/dsvm/ubuntu.- Anonymous
May 04, 2017
Gopi thanks for the comment the Ubuntu DSVM is Awesome! if your an academic and interested in finding out more about this see https://blogs.msdn.microsoft.com/uk_faculty_connection/2017/04/19/now-available-on-azure-marketplace-ubuntu-data-science-virtual-machine/
- Anonymous
- Anonymous
August 21, 2017
In microsoft azure i am not getting NV/NC options,, it is disabled for my machine.- Anonymous
August 21, 2017
Hi Shweta If you wish to see the current availability of Azure Services on your account you can run the following command:Using the Azure cloud console or Azure CLI 2.0 az vm list-usage –location eastus -o tableSee all az-vm commands at https://docs.microsoft.com/en-us/cli/azure/vmThe output from the command list-usage shows the current limits you have on Azure Resources see output belowName CurrentValue Limit——————————– ————– ——-Availability Sets 0 2000Total Regional Cores 0 100Virtual Machines 0 10000Virtual Machine Scale Sets 0 2000Basic A Family Cores 0 100Standard A0-A7 Family Cores 0 100Standard A8-A11 Family Cores 0 100Standard D Family Cores 0 100Standard Dv2 Family Cores 0 100Standard G Family Cores 0 100Standard DS Family Cores 0 100Standard DSv2 Family Cores 0 100Standard GS Family Cores 0 100Standard F Family Cores 0 100Standard FS Family Cores 0 100Standard NV Family Cores 0 24Standard NC Family Cores 0 48Standard H Family Cores 0 8Standard Av2 Family Cores 0 100Standard LS Family Cores 0 100Standard Dv2 Promo Family Cores 0 100Standard DSv2 Promo Family Cores 0 100Standard MS Family Cores 0 0Standard Dv3 Family Cores 0 100Standard DSv3 Family Cores 0 100Standard Ev3 Family Cores 0 100Standard ESv3 Family Cores 0 100Standard Storage Managed Disks 0 10000Premium Storage Managed Disks 0 10000To request a quota increase, you must open a Support Request with Microsoft: https://docs.microsoft.com/en-us/azure/azure-supportability/resource-manager-core-quotas-request - Anonymous
August 21, 2017
They are disabled for the Free Trial Subscription. Please check.- Anonymous
August 23, 2017
The comment has been removed
- Anonymous
- Anonymous
- Anonymous
August 25, 2017
Hi, i see a difference in output... though installation is successfull.. it does not use its gpu capabilities.. how do i go about it..see the below output here..:~/Azure-GPU-Setup$ python gpu-test.py...loaded python test [now attempting to list GPUs]I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locallyI tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locallyI tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locallyI tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locallyI tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locallyW tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: name: Tesla K80major: 3 minor: 7 memoryClockRate (GHz) 0.8235pciBusID b978:00:00.0Total memory: 11.17GiBFree memory: 11.11GiBI tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: b978:00:00.0)[u'/gpu:0']- Anonymous
August 25, 2017
Ensure you have downloaded cudnn-8.0-linux-x64-v5.1-tgz as per Script2 you can obtain the CUDNN file from the Nvidia site https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v5.1/prod_20161129/8.0/cudnn-8.0-linux-x64-v5.1-tgz you will need to register to download this file. see instructions in the script "You need to download cudnn-8.0 manually this can be downloaded from https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v5.1/prod_20161129/8.0/cudnn-8.0-linux-x64-v5.1-tgz you will need to create a NVIDIA Account! Specifically, place it at: $SETUP_DIR/cudnn-8.0-linux-x64-v5.1.tgz"
- Anonymous