Issue with Autoscaling Nodes in Azure CycleCloud SLURM Cluster

Parham Pourbozorgi 20 Reputation points
2024-11-05T09:45:21.0033333+00:00

I have created a SLURM cluster on Azure CycleCloud and enabled autoscaling with a maximum of 20 nodes for HPC. I have verified that there is enough quota for at least 10 HPC and 2 HTC nodes. However, upon booting the cluster, only 5 nodes are available in HPC, and the number does not increase beyond that.

I have tried several measures, including updating the configuration file to adjust the number of HPC nodes, using scontrol reconfigure, and running azslurm scale with root privileges, but none have been successful in increasing the total number of nodes.

If anyone has insights or solutions to resolve this issue, it would be appreciated. Let me know if you need more info on this. Here is the output of the sinfo command for reference:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
dynamic      up   infinite      0    n/a
hpc*         up   infinite      5  idle~ ccw-hpc-[1-5]
htc          up   infinite      2  idle~ ccw-htc-[1-2]

User's image

Azure CycleCloud
Azure CycleCloud
A Microsoft tool for creating, managing, operating, and optimizing high-performance computing (HPC) and big compute clusters in Azure.
66 questions
{count} votes

Accepted answer
  1. Prrudram-MSFT 27,251 Reputation points
    2024-11-10T17:02:39.6833333+00:00

    Hello @Parham Pourbozorgi

    Thanks for sharing the details. It appears that you’re using a customized cluster creation UI form  (the default Slurm cluster creation form offers limits in terms of Cores rather than “Nodes”).

    Since the GUI is from a custom cluster template, it’s hard to know for sure what it is showing. But I’d guess that “Max HPC Nodes” maps to the MaxCount cluster parameter for the HPC Nodearray.   In CycleCloud, both MaxCount and MaxCoreCount are user imposed limits on the autoscaler and do not reflect your Azure quota (so you could set have a lower quota than the limit set in Max HPC Nodes).

    An easy way to tell what your remaining Available Quota is for your cluster is to go to the Clusters page and click on the “Actions -> Add” action in the nodes table.  That will show you the remaining cores available for use by the cluster.

    Here’s an example from one of my CycleCloud installs.  It shows that I can create 39 more F2s_v2 VMs based on my current available Regional Quota of 100 cores (I must be using  the other 22 cores for some other cluster).   Note that there are 2 quotas (Regional + per-Family quota)  PLUS my current  MaxCoreCount limit that are limiting the number  of nodes I can create.

    This is the number that determines the scale of the Slurm cluster.  

    Can you check what your CycleCloud shows here?

    User's image

    Tag me in your comments with the required details please.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.