Troubleshoot Node Not Ready failures that are followed by recoveries
This article provides a guide to troubleshoot and resolve Node Not Ready" issues in Azure Kubernetes Service (AKS) clusters. When a node enters a "NotReady" state, it can disrupt the application's functionality and cause it to stop responding. Typically, the node recovers automatically after a short period. However, to prevent recurring issues and maintain a stable environment, it's important to understand the underlying causes to be able to implement effective resolutions.
Cause
There are several scenarios that could cause a "NotReady" state to occur:
The unavailability of the API server. This causes the readiness probe to fail. This prevents the pod from being attached to the service so that traffic is no longer forwarded to the pod instance.
Virtual machine (VM) host faults. To determine whether VM host faults occurred, check the following information sources:
- AKS diagnostics
- Azure status
- Azure notifications (for any recent outages or maintenance periods)
Resolution
To resolve this issue, follow these steps:
- Run
kubectl describe node <node-name>
to review detail information about the node's status. Look for any error messages or warnings that might indicate the root cause of the issue. - Check the API server availability by running the
kubectl get apiservices
command. Make sure that the readiness probe is correctly configured in the deployment YAML file. - Verify the node's network configuration to make sure that there are no connectivity issues.
- Check the node's resource usage, such as CPU, memory, and disk, to identify potential constraints. For more informations see Monitor your Kubernetes cluster performance with Container insights
For further steps, see Basic troubleshooting of Node Not Ready failures.
Prevention
To prevent this issue from occurring in the future, take one or more of the following actions:
- Make sure that your service tier is fully paid for.
- Reduce the number of
watch
andget
requests to the API server.