NEW REFERENCE ARCHITECTURE: Distributed training of deep learning models on Azure
The AzureCAT blog has moved! Find this blog post over on our new blog at the Microsoft Tech Community: https://techcommunity.microsoft.com/t5/AzureCAT/NEW-REFERENCE-ARCHITECTURE-Distributed-training-of-deep-learning/ba-p/333652
========================
Our sixth AI reference architecture (on the Azure Architecture Center) is authored by AzureCAT Mathew Salvaris, edited by Nanette Ray, and published by Mike Wasson.
Reference architectures provide a consistent approach and best practices for a given solution. Each architecture includes recommended practices, along with considerations for scalability, availability, manageability, security, and more. This architecture includes a deployable solution as well. The full array of reference architectures is available on the Azure Architecture Center.
This reference architecture shows how to conduct distributed training of deep learning models across clusters of GPU-enabled virtual machines (VMs). The scenario is image classification, but the solution can be generalized for other deep-learning scenarios, such as segmentation and object detection.
This architecture consists of the following components:
- Azure Batch AI plays the central role in this architecture by scaling resources up and down according to need.
- Blob storage is used to stage the data.
- Azure Files is used to store the scripts, logs, and the final results from the training.
- Batch AI file server is a single-node NFS share used in this architecture to store the training data.
- Docker Hub is used to store the Docker image that Batch AI uses to run the training. Azure Container Registry can also be used.
Topics covered include:
- Performance considerations
- Scalability considerations
- Storage considerations
- Security considerations
- Monitoring considerations
- Deployment
Head over to the Azure Architecture Center to learn more about the Distributed training of deep learning models on Azure reference architecture.
See Also
Additional related AI reference architectures:
- Batch scoring on Azure for deep learning models
- Batch scoring of Python models on Azure
- Real-time scoring of Python Scikit-Learn and deep learning models on Azure
- Real-time scoring of R machine learning models
- Build a real-time recommendation API on Azure
Find all our reference architectures here.
AzureCAT Guidance
"Hands-on solutions, with our heads in the Cloud!"
Comments
- Anonymous
February 13, 2019
Hey, since Batch AI is being retired in, like .... a month, maybe you want to revise this to show that Batch AI is available for provisioning in the Azure ML Services instead?