Does AKS have a retry mechanism if a node exceeds drain timeout during node image upgrade?

Srinath NS 20 Reputation points Microsoft Employee
2024-10-17T20:19:32.3433333+00:00

AKS supports configuring drain timeout. Per the documentation, the default value is 30 minutes. Which means, if a long running pod doesn't terminate within 30 minutes, AKS does not perform any retry but simply the node upgrade fails, and thereby the nodepool and the cluster.

  1. If a node upgrade fails, do we still continue to other nodes?
  2. Since there is no mention about the retry count anywhere in the documentation, I'm eager to know whether AKS supports one and if so, what is the count?
Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
2,280 questions
0 comments No comments
{count} votes

Accepted answer
  1. Akshay kumar Mandha 2,665 Reputation points Microsoft Vendor
    2024-10-18T00:41:08.44+00:00

    Hi Srinath NS,
    Welcome to the Microsoft Q&A Platform! Thank you for asking your question here.

    Based upon your question, when you are upgrading AKS nodes, the system gives a default time limit of 30 minutes to safely move any running tasks (pods) off the node before upgrading it. If there's a task that takes longer than 30 minutes to finish, the upgrade process will stop for that specific node. At that time, you have to increase this time limit with commands so that tasks get more time to complete before the upgrade moves ahead. it has to setup manually as below using command and also Please refer this document once Set node drain timeout value

    # Set drain timeout for a new node pool
    az aks nodepool add --name mynodepool --resource-group MyResourceGroup --cluster-name MyManagedCluster  --drain-timeout 100
    # Update drain timeout for an existing node pool
    az aks nodepool update --name mynodepool --resource-group MyResourceGroup --cluster-name MyManagedCluster --drain-timeout 45 below is the document
    
    

    If for some reason cluster fails a node upgrade fails, AKS will pause the entire process and won't continue upgrading other nodes until the problem is fixed. To help manage nodes automatically when something goes wrong, there is a feature called auto-repair, but it's turned off by default. You can enable it to make sure nodes get fixed without manual involvement.

    For more information, please refer this document Azure Kubernetes Service (AKS) node auto-repair

    I hope you got the clarity on this topic.!

    If you found this information helpful, please click an accepting the answer and "Upvote" on my post for other community members referenceUser's image

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.