Does AKS have a retry mechanism if a node exceeds drain timeout during node image upgrade?

Question

AKS supports configuring drain timeout. Per the documentation, the default value is 30 minutes. Which means, if a long running pod doesn't terminate within 30 minutes, AKS does not perform any retry but simply the node upgrade fails, and thereby the nodepool and the cluster.

If a node upgrade fails, do we still continue to other nodes?
Since there is no mention about the retry count anywhere in the documentation, I'm eager to know whether AKS supports one and if so, what is the count?

Accepted Answer

Hi Srinath NS,
Welcome to the Microsoft Q&A Platform! Thank you for asking your question here.

Based upon your question, when you are upgrading AKS nodes, the system gives a default time limit of 30 minutes to safely move any running tasks (pods) off the node before upgrading it. If there's a task that takes longer than 30 minutes to finish, the upgrade process will stop for that specific node. At that time, you have to increase this time limit with commands so that tasks get more time to complete before the upgrade moves ahead. it has to setup manually as below using command and also Please refer this document once Set node drain timeout value

# Set drain timeout for a new node pool
az aks nodepool add --name mynodepool --resource-group MyResourceGroup --cluster-name MyManagedCluster  --drain-timeout 100
# Update drain timeout for an existing node pool
az aks nodepool update --name mynodepool --resource-group MyResourceGroup --cluster-name MyManagedCluster --drain-timeout 45 below is the document

If for some reason cluster fails a node upgrade fails, AKS will pause the entire process and won't continue upgrading other nodes until the problem is fixed. To help manage nodes automatically when something goes wrong, there is a feature called auto-repair, but it's turned off by default. You can enable it to make sure nodes get fixed without manual involvement.

For more information, please refer this document Azure Kubernetes Service (AKS) node auto-repair

I hope you got the clarity on this topic.!

If you found this information helpful, please click an accepting the answer and "Upvote" on my post for other community members reference User's image

Share via

Does AKS have a retry mechanism if a node exceeds drain timeout during node image upgrade?

0 additional answers

Your answer