Troubleshoot UpgradeFailed errors due to eviction failures caused by PDBs
This article discusses how to identify and resolve UpgradeFailed errors due to eviction failures caused by Pod Disruption Budgets (PDBs) that occur when you try to upgrade an Azure Kubernetes Service (AKS) cluster.
Prerequisites
This article requires Azure CLI version 2.67.0 or a later version. To find the version number, run az --version
. If you have to install or upgrade Azure CLI, see How to install the Azure CLI.
For more detailed information about the upgrade process, see the "Upgrade an AKS cluster" section in Upgrade an Azure Kubernetes Service (AKS) cluster.
Symptoms
An AKS cluster upgrade operation fails with one of the following error messages:
-
(UpgradeFailed) Drain
node aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx
failed when evicting pod<pod-name>
failed with Too Many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See https://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info:<namespace>/<pod-name>
blocked by pdb<pdb-name>
with 0 unready pods. -
Code: UpgradeFailed
Message: Drain nodeaks-<nodepool-name>-xxxxxxxx-vmssxxxxxx
failed when evicting pod<pod-name>
failed with Too Many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See https://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info:<namespace>/<pod-name>
blocked by pdb<pdb-name>
with 0 unready pods.
Cause
This error might occur if a pod is protected by the Pod Disruption Budget (PDB) policy. In this situation, the pod resists being drained, and after several attempts, the upgrade operation fails, and the cluster/node pool falls into a Failed
state.
Check the PDB configuration: ALLOWED DISRUPTIONS
value. The value should be 1
or greater. For more information, see Plan for availability using pod disruption budgets. For example, you can check the workload and its PDB as follows. You should observe the ALLOWED DISRUPTIONS
column doesn't allow any disruption. If the ALLOWED DISRUPTIONS
value is 0
, the pods aren't evicted and node drain fails during the upgrade process:
$ kubectl get deployments.apps nginx
NAME READY UP-TO-DATE AVAILABLE AGE
nginx 2/2 2 2 62s
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
nginx-7854ff8877-gbr4m 1/1 Running 0 68s
nginx-7854ff8877-gnltd 1/1 Running 0 68s
$ kubectl get pdb
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
nginx-pdb 2 N/A 0 24s
You can also check for any entries in Kubernetes events using the command kubectl get events | grep -i drain
. A similar output shows the message "Eviction blocked by Too Many Requests (usually a pdb)":
$ kubectl get events | grep -i drain
LAST SEEN TYPE REASON OBJECT MESSAGE
(...)
32m Normal Drain node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx Draining node: aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx
2m57s Warning Drain node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx Eviction blocked by Too Many Requests (usually a pdb): <pod-name>
12m Warning Drain node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx Eviction blocked by Too Many Requests (usually a pdb): <pod-name>
32m Warning Drain node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx Eviction blocked by Too Many Requests (usually a pdb): <pod-name>
32m Warning Drain node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx Eviction blocked by Too Many Requests (usually a pdb): <pod-name>
31m Warning Drain node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx Eviction blocked by Too Many Requests (usually a pdb): <pod-name>
To resolve this issue, use one of the following solutions.
Solution 1: Enable pods to drain
Adjust the PDB to enable pod draining. Generally, The allowed disruption is controlled by the
Min Available / Max unavailable
orRunning pods / Replicas
parameter. You can modify theMin Available / Max unavailable
parameter at the PDB level or increase the number ofRunning pods / Replicas
to push the Allowed Disruption value to 1 or greater.Try again to upgrade the AKS cluster to the same version that you tried to upgrade to previously. This process triggers a reconciliation.
$ az aks upgrade --name <aksName> --resource-group <resourceGroupName> Are you sure you want to perform this operation? (y/N): y Cluster currently in failed state. Proceeding with upgrade to existing version 1.28.3 to attempt resolution of failed cluster state. Since control-plane-only argument is not specified, this will upgrade the control plane AND all nodepools to version . Continue? (y/N): y
Solution 2: Back up, delete, and redeploy the PDB
Take a backup of the PDB(s) using the command
kubectl get pdb <pdb-name> -n <pdb-namespace> -o yaml > pdb-name-backup.yaml
, and then delete the PDB using the commandkubectl delete pdb <pdb-name> -n <pdb-namespace>
. After the new upgrade attempt is finished, you can redeploy the PDB just applying the backup file using the commandkubectl apply -f pdb-name-backup.yaml
.Try again to upgrade the AKS cluster to the same version that you tried to upgrade to previously. This process triggers a reconciliation.
$ az aks upgrade --name <aksName> --resource-group <resourceGroupName> Are you sure you want to perform this operation? (y/N): y Cluster currently in failed state. Proceeding with upgrade to existing version 1.28.3 to attempt resolution of failed cluster state. Since control-plane-only argument is not specified, this will upgrade the control plane AND all nodepools to version . Continue? (y/N): y
Solution 3: Delete the pods that can't be drained or scale the workload down to zero (0)
Delete the pods that can't be drained.
Note
If the pods are created by a Deployment or StatefulSet, they'll be controlled by a ReplicaSet. If that's the case, you might have to delete or scale the workload replicas to zero (0) of the Deployment or StatefulSet. Before you do that, we recommend that you make a backup:
kubectl get <deployment.apps -or- statefulset.apps> <name> -n <namespace> -o yaml > backup.yaml
.To scale down, you can use
kubectl scale --replicas=0 <deployment.apps -or- statefulset.apps> <name> -n <namespace>
before the reconciliationTry again to upgrade the AKS cluster to the same version that you tried to upgrade to previously. This process triggers a reconciliation.
$ az aks upgrade --name <aksName> --resource-group <resourceGroupName> Are you sure you want to perform this operation? (y/N): y Cluster currently in failed state. Proceeding with upgrade to existing version 1.28.3 to attempt resolution of failed cluster state. Since control-plane-only argument is not specified, this will upgrade the control plane AND all nodepools to version . Continue? (y/N): y
Contact us for help
If you have questions or need help, create a support request, or ask Azure community support. You can also submit product feedback to Azure feedback community.