TensorFlow on Azure: Enabling Blob Storage via Alluxio
Many customers Cloud AI Ecosystem in Microsoft works with, choose Azure Blob Storage as their data storage. Among those customers, if one wants to use TensorFlow to develop deep learning models, unfortunately TensorFlow does not support Azure Blob storage out of box as its custom file system plugin1. There is no easy way to directly feed data from Azure block blobs into TensorFlow’s input pipeline2. In case of setting up a Kubernetes cluster for TensorFlow workloads, Azure Blob Storage is not included in k8s’ supported Types of Volumes3.
Given above, the odds are either to mount an Azure File share to Kubernetes pods and read remote Azure files via the mount path, or to manually copy data to a local SSD disk of each pod. Those approaches are summarized as part of ‘Deep Learning Toolkits with Kubernetes Clusters’ published at https://aka.ms/deeplearningk8s; however, many people would prefer Azure Blob to Azure File service because of their different performance, scale and pricing options5; or in the case of manual copy to local SSD, it’s not scalable to repeat that operation with big amount of training data.
In this blog we introduce Alluxio’s newly released FUSE6 feature in Alluxio 1.7, which enables mounting Azure Blob storage to the local file system namespace, and solves integration between TensorFlow and Azure Blobs. It aims to bridge high computation workloads including TensorFlow jobs with underlying storage system, via its unified Alluxio data access layer. Alluxio-FUSE feature opens up new opportunities for Azure Blobs to be directly fed into your tensors; moreover, with current progress made on GPU computation, the input pipeline might become a bottleneck if the storage is not performant enough. The effort Alluxio has paid to optimize data access layer brings positive impact to the DL data input pipeline. For more details, please refer to https://alluxio.com/blog/flexible-and-fast-storage-for-deep-learning-with-alluxio.
Follow the simple steps below to check out how to enable Azure Blobs via Alluxio-FUSE to run TensorFlow jobs on Azure.
Set up A Kubernetes Cluster
A sample k8s cluster on Azure Container Service is deployed using open source toolkit DLWorkspace8, the documentation can be found at https://microsoft.github.io/DLWorkspace/. The sample setup includes one master node of Standard D2 v2 Azure VM(2 vcpus, 7GB memory), and two agent nodes of Standard NC12(12 vcpus, 112 GB memory) Azure VM. To check if the GPU driver is correctly installed after deployment, run ‘nvidia-smi’ on each agent node to find the driver information.
Create Alluxio-FUSE Enabled Pods
For ease of use, Alluxio includes Docker integration9 and published its 1.7 Docker images10 on Docker Hub, we can pull images into the k8s cluster and create Alluxio-FUSE enabled k8s pods. Alluxio servers consist of two architectural components11: master and workers, where the master is responsible for managing global metadata and workers are responsible for managing local storage resources allocated to Alluxio. We co-locate Alluxio master and TensorFlow parameter server on the same pod, also Alluxio workers and TensorFlow workers on the same pod, for better data locality.
Find sample pod configuration files posted at https://github.com/jichang1/TensorFlowonAzure/tree/master/Alluxio to create your k8s pods: first tf-ps pod and then tf-worker pod. Note you will replace $yourcontainername$, $yourstorageaccountname$, and $yourstorageaccountkey$ . Replace $yourpsserverip$ with the IP found from /etc/hosts of the tf-ps pod.
The sample container configuration below, tells that the docker image runs /entrypoint.sh upon initialization with argument “worker”; the worker pod communicates with the master pod using port 19998; a few environment variables need to be defined upon initialization such as master host name, storage account, etc; the process runs in privileged ‘SYS_ADMIN’ mode.
containers:
- name: tf-worker0
image: alluxio/alluxio-tensorflow:1.7.0-1.3.0-gpu
command: ["/entrypoint.sh"]
args: ["worker"]
ports:
- containerPort: 19998
name: alluxioport
env:
- name: ALLUXIO_MASTER_HOSTNAME
value: "$yourpsserverip$"
- name: ALLUXIO_RAM_FOLDER
value: "/opt/ramdisk"
- name: ALLUXIO_WORKER_MEMORY_SIZE
value: "10GB"
- name: ALLUXIO_UNDERFS_ADDRESS
value: "wasb://$yourcontainername$@$yourstorageaccountname$.blob.core.windows.net/"
- name: FS_AZURE_ACCOUNT_KEY_$yourstorageaccountname$_BLOB_CORE_WINDOWS_NET
value: $yourstorageaccountkey$
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1
volumeMounts:
- mountPath: /usr/local/nvidia
name: nvidia-driver
- mountPath: /opt/ramdisk
name: ramdisk
- mountPath: /etc/resolv.conf
name: resolv
securityContext:
privileged: true
capabilities:
add: ["SYS_ADMIN"]
nodeSelector:
FragmentGPUJob: active
volumes:
- name: nvidia-driver
hostPath:
path: /opt/nvidia-driver/current
- name: ramdisk
hostPath:
path: /mnt/ramdisk
- name: resolv
hostPath:
path: /etc/resolv.conf
After executing
sudo kubectl apply -f ./alluxio-fuse-tfgpu-psserver0.yaml
sudo kubectl apply -f ./alluxio-fuse-tfgpu-worker0.yaml
Run
kubectl get pods
to check pods are up and in healthy state.
Connect to tf-ps or tf-worker pods, access your blob storage via ‘ls /alluxio-fuse’ which is already mounted to local file system.
Run TensorFlow Jobs
We take TensorFlow benchmark jobs12 as an example.
On the parameter server pod, run the command below:
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=2 --batch_size=128 --model=googlenet --variable_update=parameter_server --num_batches=50 --cross_replica_sync=False --data_name=imagenet --data_dir=file:///alluxio-fuse/ --job_name=ps --ps_hosts=10.244.2.2:2222 --worker_hosts=10.244.0.2:2222 --task_index=0
On the worker pod, run the command below:
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=2 --batch_size=128 --model=googlenet --variable_update=parameter_server --num_batches=50 --cross_replica_sync=False --data_name=imagenet --data_dir=file:///alluxio-fuse/ --job_name=worker --ps_hosts=10.244.2.2:2222 --worker_hosts=10.244.0.2:2222 --task_index=0
You should observe output similar to below
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0e7c:00:00.0
Total memory: 11.17GiB
Free memory: 11.09GiB
……
2018-01-11 02:22:52.188017: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0e7c:00:00.0)
2018-01-11 02:22:52.188038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 27e6:00:00.0)
2018-01-11 02:22:52.402467: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 10.244.2.2:2222}
2018-01-11 02:22:52.402510: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2222}
2018-01-11 02:22:52.405246: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:2222
TensorFlow: 1.3
Model: googlenet
Mode: training
Batch size: 256 global
128 per device
Devices: ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1']
Data format: NCHW
Optimizer: sgd
Variables: parameter_server
Sync: False
==========
Generating model
2018-01-11 02:24:26.461062: I tensorflow/core/distributed_runtime/master_session.cc:998] Start master session b26ef5a1286e9840 with config: intra_op_parallelism_threads: 1 gpu_options { force_gpu_compatible: true } allow_soft_placement: true
Running warm up
Done warm up
Waiting for other replicas to finish warm up
Starting real work at step 10 at time Thu Jan 11 02:25:57 2018
Step Img/sec loss
1 images/sec: 200.3 +/- 0.0 (jitter = 0.0) 7.093
10 images/sec: 189.8 +/- 1.6 (jitter = 4.7) 7.093
20 images/sec: 186.3 +/- 1.3 (jitter = 6.3) 7.093
30 images/sec: 186.6 +/- 1.1 (jitter = 6.0) 7.093
40 images/sec: 186.8 +/- 0.9 (jitter = 5.5) 7.093
Finishing real work at step 59 at time Thu Jan 11 02:27:04 2018
50 images/sec: 187.4 +/- 0.8 (jitter = 5.3) 7.093
----------------------------------------------------------------
total images/sec: 186.67
----------------------------------------------------------------
We hope this blog has provided you a new way of running TensorFlow jobs on Azure with underlying Azure Blob storage.
Reference
1 https://www.tensorflow.org/extend/add_filesys
2 https://www.tensorflow.org/programmers_guide/datasets
3 https://kubernetes.io/docs/concepts/storage/volumes/#types-of-volumes
4 /en-us/azure/storage/common/storage-scalability-targets
5 https://azure.microsoft.com/en-us/pricing/details/storage/
6 https://www.alluxio.org/docs/master/en/Mounting-Alluxio-FS-with-FUSE.html
7 https://www.alluxio.org/docs/master/en/index.html
8 https://github.com/Microsoft/DLWorkspace
9 https://github.com/Alluxio/alluxio/tree/master/integration/docker
10 https://hub.docker.com/r/alluxio/alluxio-tensorflow/