AKS Cluster takes very long to scale up

Question

Hi,

I have an AKS cluster that takes too long to scale up.

I can see on the node pools the activity scaling up, but the nodes can take more than 15 minutes to scale up.

What could be the issue?

Answer

Thank you for reaching back again with further information. I have investigated further on this issue and hope the below findings will help you.

Scaling up isn’t immediate. It may take some time before the created nodes appear in Kubernetes. It almost entirely depends on the cloud provider and the speed of node provisioning, including the TLS bootstrapping process. Cluster Autoscaler expects requested nodes to appear within 15 minutes (configured by --max-node-provision-time flag.) After this time, if they are still unregistered, it stops considering them in simulations and may attempt to scale up a different group if the pods are still pending. It will also attempt to remove any nodes left unregistered after this time. However, if you're consistently seeing scaling times of over 15 minutes, there may be an issue that needs to be addressed. The cluster autoscaler doesn’t take into account actual CPU/GPU/Memory usage, just resource requests and limits. https://learn.microsoft.com/en-us/azure/aks/cluster-autoscaler?tabs=azure-cli#using-the-autoscaler-profile

Have you examined your quotas? Are they high enough?

https://learn.microsoft.com/en-us/azure/aks/quotas-skus-regions#service-quotas-and-limits

If you use Azure CNI, do you have enough IPs available in your VNET? https://learn.microsoft.com/en-us/azure/aks/azure-cni-overlay?tabs=kubectl

As a workaround, consider using virtual nodes to handle bursts better. To rapidly scale application workloads in an AKS cluster, you can use virtual nodes. With virtual nodes, you have quick provisioning of pods, and only pay per second for their execution time. You don't need to wait for Kubernetes cluster autoscaler to deploy VM compute nodes to run more pods. Virtual nodes are only supported with Linux pods and nodes. Another workaround to avoid issues where you need to wait to scale up, most teams want to leave some resources idle. While you can overprovision by using “pause pods” with low priority to “reserve” space for pods of higher priority, this requires a fair amount of configuration. And again, the inputs to the scaling algorithm here are resource requests, not actual utilization.

If you're still having trouble identifying the root cause of the issue and If you have a support plan I suggest you please file a support ticket for deeper investigation.

If you have any further queries, do let us know.

If the comment is helpful, please click upvote on this post.

Share via

AKS Cluster takes very long to scale up

1 answer

Your answer