[Azure HPC] Intro to HPC and steps to setup CycleCloud in Azure
<update:9/2/2018>
Aug 31, Our CycleCloud team hits general availability in Azure. It's a tool for creating, managing, operating, and optimizing HPC clusters of any scale in Azure.Azure CycleCloud is available in the Microsoft Download Center, Azure Marketplace, and Azure Container Registry,
• Azure CycleCloud announcement
• Azure CycleCloud product page
• Documentation
• Azure CycleCloud download
• Azure Marketplace offering for Azure CycleCloud
• Azure Container Registry container
The following key scenarios are met by CycleCloud:
• Ability to run Linux & Windows HPC Clusters with traditional schedulers, including Slurm, PBS Pro, HPC Pack, Spectrum LSF and Symphony, Grid Engine, or HTCondor.
• Easily managing HPC clusters with multiple VM families and sizes to get capacity for critical runs
• Customizable workload templates that serve as best-practice starting points for Azure deployments
• Active directory integration for access to and management of compute environments
</update>
As part of Microsoft Internal MOOC course “Big Compute: Uncovering and Landing Hyperscale Solutions in Azure” , I was introduced to CycleCloud and learned how to setup CycleCloud in my Azure subscription. I would like to blog about some of my HPC learning + steps followed to setup one.
What is HPC? High Performance computing(HPC) is a parallel processing technique for solving complex computational problems. HPC applications can scale to thousands of compute cores. We can run these workloads in our premise by setting up clusters, extend the burst volume to cloud or run as a 100% cloud native solution.
Where is Big Compute used, usecase ? Usually compute intensive operations are best suited for this workload.
How HPC can be achieved in Microsoft Azure?
1) Azure Batch –>managed service, “cluster” as a service, running jobs, developers can write application that submit jobs using SDK, cloud native, HPC as a service, Pay as you go billing
2) CycleCloud–>acquired by MS, “cluster” management software aka orchestration software, supports hybrid clusters, multi cloud, managing and running clusters, one time license, you have complete control of the cluster and nodes
3) CrayComputer –>partnership with CrayComputer, famous weather forecasting service
4) HPC pack in Azure Infra–>Marketplace offerings {HPC Applications, HPC VM images, HPC storages}
Azure Batch doesn’t need intro as it is there for quite sometime, setting up a Batch is very easy. Tools like Batch Labs helps us to monitor/control the Batch job effortlessly. Batch SDK helps us to integrate with existing legacy application easily to submit the job or manage the entire batch operation using their custom developed application. The end uses need not to login to Azure portal for submitting the jobs.
[embed] https://twitter.com/MahesKBlr/status/948924709885771776 [/embed]
[embed] https://twitter.com/MahesKBlr/status/779008726908833792 [/embed]
What is CycleCloud? CycleCloud provides a simple, secure, and scalable way to manage compute and storage resources for HPC and Big Compute/Data workloads in Cloud. CycleCloud enables users to create environments in Azure. It supports distributed jobs and also parallel workloads to tightly-coupled applications such as MPI jobs on Infiniband/RDMA. By managing resource provisioning, configuration, and monitoring, CycleCloud allows users and IT staff to focus on business needs instead infrastructure.
How to set it up in Azure? Steps are already documented here, I am trying to put the same steps in screenshot for easy reference.
1) Download the json files to your local drive. Say, c:\temp
2) Generate the Service Principle
3) Generate SSH pub and private key
4) Clone the repo file to your local drive, say c:\temp git clone https://github.com/azurebigcompute/Labs.git
5) Edit the vms-params.json file to specify the generated rsaPublicKey parameter from Step3. The cycleDownloadUri and cycleLicenseSas parameters have been pre-configured, but if you procure license then you need to update these two params as well. For now, I am leaving as it..
6) Now login to Azure CLI, create resource group, storage account, create VNET deployment and at last create VMs C:\temp>``az loginC:\temp>az group create --name "cycle-rg" --location "southeastasia"C:\temp> az storage account create --name "mikkyccStorage" --group "cycle-rg" --location "southeastasia" --sku "Standard_LRS"C:\temp>az group deployment create --name "vnet_deployment" --resource-group "cycle-rg" --template-uri https://raw.githubusercontent.com/azurebigcompute/Labs/master/CycleCloud/deploy-vnet.json --parameters vnet-params.jsonC:\temp>az group deployment create --name "vms_deployment" --resource-group "cycle-rg" --template-uri https://raw.githubusercontent.com/azurebigcompute/Labs/master/CycleCloud/deploy-vms.json --parameters vms-params.json
7) Post the deployment, you will find the above set of resources created in our resource group say “cycle-rg”. Select the Cycleserver VM and copy the IP address to see if you can browse CycleCloud setup page.
8) Pls note, the installation uses a self-signed SSL certificate, which may show up with a warning in your browser. So, it is safe to ignore the warning and add it as exception to get the page like the after setting up the cluster (refer configure “CycleCloud Server” section from this page). If you get the below page after all the setup, then we are ready to create new cluster and submit the jobs.
9) Refer the section as it is “Creating a Grid Engine Cluster” 5.1 as it is from here
10) After the cluster is created, we need to start the cluster and see it is running like the below.
11) Now our Grid Engine cluster is ready for the job submission, For security reasons, the CycleCloud VM (CycleServer) is behind a jump box/bastion host. To access CycleServer, we must first log onto the jump box, and then ssh onto the CS instance. To do this, we'll add a second host to jump through to the ssh commands.
From Azure portal, retrieve the admin box DNS and construct the SSH command as in screenshot. The idea is to “ssh –J” to our CycleServer through CycleAdmin box. One cannot directly ssh into CycleServer which is for security.
$ ssh -J cycleadmin@{JUMPBOX PUBLIC HOSTNAME} cycleadmin@cycleserver -i {SSH PRIVATE KEY}
12) Once we get into CycleAdmin@CycleServer, first change into root user and call CycleCloud Initialize command. You need to enter the username and password for that machine.
13) Connecting to the Grid Engine Master as
[root@cycleserver ~]$ cyclecloud connect master –c <clustername>
14) Now ready to submit our first job, qstatis to query the status of grid engine jobs and queues & qsubis to submit the batch jobs .
15) On successful submission, we should see the job started executing in our nodes.
Master takes the batch job and getting executed from 3 nodes spin under execute node template
By the way, if we login the Azure portal and navigate to the RG, then we would see there is VMSS created as part of execute worker nodes.
we could also set the autoscaling feature from CycleCloud cluster settings, so the Azure VM’s comes and goes away once the job is completed. We have submitted 100 jobs per our command so it will request 100 cores. Based on the cluster core limit, it will decide whether to scale till that or not. Let say, if we have set 100 cores as cluster scale limit, then we would see many other VM’s also getting created to complete the task in parallel.
[cyclecloud@ip-0A000404 ~]$ qsub -t 1:100 -V -b y -cwd hostname
Once the job is completed, we can terminate the cluster and also delete the RG if you don’t want to retain which is our last step. I know it’s a bit of learning + confusing to start for the first time, but once you hands-on then it is easy to setup whenever you require and dispose off after completing our jobs.
Happy learning !