Problem getting GPU solving to work with our Azure CycleCloud / Slurm HPC cluster System

Gary Mansell 131 Reputation points
2023-12-20T10:30:36.0433333+00:00

I am using the Azure CycleCloud 8.4 Marketplace image and it is fully updated, along with Slurm version 22.05.8-1.

I have configured a GPU Enabled Slurm Partition consisting of some NC24sv3 VMs (which have 4x Nvidia Tesla V100 GPUS in each), but the Slurm Scheduler is showing the Partition as "Invalid" when the processing nodes are created at job run time:

User's image

The Azure Slurm config seems to detect and configure the GPU enabled VMs:

User's image

But, the Slurm node is showing as invalid "as the GPU count reported is less (0) than the (4) that is configured":

User's image

The GPU nodes are built from the Azure HPC AlmaLinux8 image, and when I login to them, the NVIDIA driver is installed and reporting the 4x GPU's:

User's image

Azure Cloud-Init configuration for the VMs is as follows:

User's image

So, I am a bit stuck and hoping someone can help me to get the GPU Slurm partition to initialise correctly - can anyone give me any help or suggestions, please?

Azure CycleCloud
Azure CycleCloud
A Microsoft tool for creating, managing, operating, and optimizing high-performance computing (HPC) and big compute clusters in Azure.
66 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Gary Mansell 131 Reputation points
    2024-01-12T15:33:31.5166667+00:00

    Microsoft Support helped me to a fix on this issue - it was NOT a problem with the AlmaLinux8 HPC image as I had first surmised...

    There were two issues - firstly yum had updated my CycleCloud to version 8.5 (from 8.4), which meant my Slurm 3.0.1 Cluster templates were out of date wrt. CycleCloud version. I first needed to download the Azure CycleCloud 3.0.5 default template (https://github.com/Azure/cyclecloud-slurm/blob/master/templates/slurm.txt) and merge it with my custom template, and then create a new 8.5 Cluster with the correct Slurm template version.

    This was the main cause of the problem - for some reason only Ampere GPU jobs ran with this out of date configuration, which confused things, we don't really know why. But this messed up configuration was causing there to be no gres.conf (outlining the gpus available) to be in /sched, nor linked to in /etc/slurm

    Once the cluster template was fixed and new cluster built, I just had to ensure that my slurm job runit script had the following option in it so that there were GPUs available to my job

    ## Specify the number of GPUs for the task
    #SBATCH --gres=gpu:4
    

    Then, the GPU nodes would initialise correctly in Slurm and I could then run jobs against the GPUs.

    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.