Proactive monitoring best practices for Azure Kubernetes Service (AKS)

Article
11/10/2024

This article covers the best practices for proactive monitoring on Azure Kubernetes Service (AKS) and provides a comprehensive list of the key signals AKS recommends for you to monitor.

Proactively monitoring your AKS clusters is crucial for reducing downtime and saving business interruptions for your applications. This process involves identifying and monitoring key indicators of abnormal behavior in your cluster that might lead to major issues or downtime.

Monitoring and alerting overview

Monitoring on AKS involves using metrics, logs, and events to ensure the health and performance of your cluster. Common scenarios to monitor include node performance, pod status, and overall resource utilization in your cluster. Logs provide insights into system events and cluster operations and activity. For more information about the methods and signals AKS provides for monitoring, see Monitor Azure Kubernetes Service (AKS).

The best way to proactively monitor your cluster is to configure Azure Monitor alerts. Alerts act as proactive measures to notify you of potential issues or anomalies before they escalate into critical problems. By defining thresholds for key metrics and logs, you receive immediate alerts when these signals exceed predefined limits, indicating potential issues like resource exhaustion or application failures. We highly recommend defining service-level objectives (SLOs) for your application to measure the performance and reliability of your service. Configuring alerts on the key signals for your SLOs allows you to quickly detect any degradation of your application's quality of service that your customers receive. Overall, setting timely alerts enables you to quickly investigate and remediate problems, minimizing downtime and ensuring high availability of applications running on your AKS cluster.

How to configure alerts on specific metric types

Metric type	Where to find these metrics	How to configure alerts
AKS Platform Metric	View platform metrics through the Metrics blade in the Azure portal.	You can create, update, and delete metric alerts through the Azure portal. For more information, see Create a metric alert for an Azure resource.
Azure Managed Prometheus Metric	To access Prometheus metrics, you need to enable Managed Prometheus. For details on how to enable and view Prometheus metrics, see Azure Monitor and Prometheus.	For guidance on configuring Prometheus alerts, see Azure Monitor managed service for Prometheus rule groups.
Azure Activity Logs	View activity logs through the Azure portal. For more information, see Azure activity logs for AKS.	Configure alerts on activity logs through the Azure portal. For more information, see Activity log alerts.
Azure Virtual Machine Scale Set Metric	View the Virtual Machine Scale Set metrics through the Azure portal.	1. To find the Virtual Machine Scale Set instance associated with your node pool, navigate to the Settings > Properties blade for your AKS cluster in the Azure portal. 2. Select your infrastructure resource group to view the infrastructure resources associated with your cluster. 3. Select the Virtual Machine Scale Set instance that matches the name of your node pool you're creating alerts for. 4. Navigate to the Alerts blade to create your metric alert.
Load Balancer Metric	View load balancer metrics through the Load Balancer page in Azure portal.	1. To find the load balancer instance associated with your node pool, navigate to the Settings > Properties blade for your AKS cluster in the Azure portal. 2. Select your infrastructure resource group to view the infrastructure resources associated with your cluster. 3. Select the load balancer instance to bring up the Azure portal page for load balancer. 4. Navigate to the Alerts page to create your load balancer metric alert.
Logs and Events	To alert on logs and events, you need to enable Container Insights. For more information, see Azure Monitor resource logs.	For guidance on creating alerts on logs and events, see Create log search alerts from Container insights.

Critical signals for configuring alerts

To get holistic coverage of your AKS environment, you need to configure alerts on the three main components of your cluster:

Cluster infrastructure: Alerts targeting the underlying infrastructure of your cluster such as nodes, disks, and networking.
Application health: Alerts for monitoring the health of your pods and applications. Some common indicators of unhealthy applications include out-of-memory kills (OOMKills) of your pods, pods in not ready state, etc.
Kubernetes control plane: Alerts on AKS control plane to monitor the health and performance of the API server, etcd, and other components.

The following sections contain the key signals which we recommend all AKS customers monitor closely. The AKS team is working to add all critical signals to the existing Recommended Alerts feature, which allows you to easily enable alerts for all signals with a one-click experience. The Prometheus metrics alerts are available in Public Preview today, and the remaining alerts are estimated to be available in early 2025. For now, you can manually configure alerts on the critical signals.

Cluster infrastructure alerts

Alert scenario	Source	Signal	Recommended threshold
Cluster is in a failed state	Azure Activity Logs	Create or update managed cluster	Status of the log is Failed, indicating that the cluster upgrade or creation action failed.
Node pool is in a failed state	Azure Activity Logs	Create or update agent pool	Status of the log is Failed, indicating that the node pool is in a Failed state due to a failed Create, Read, Upgrade, or Delete (CRUD) operation.
High Node OS Disk Bandwidth Usage	Virtual Machine Scale Set Metric	OS Disk Bandwidth Consumed Percentage	Node OS disk bandwidth utilization is above 95%.
High Node OS Disk IOPS Usage	Virtual Machine Scale Set Metric	OS Disk IOPS Consumed Percentage	Node OS disk IOPS utilization is above 95%.
High Node OS Disk Space Usage	AKS Platform Metric	Disk Used Percentage	Node OS disk space percentage utilization is above 90%.
High Node CPU Usage	AKS Platform Metric	CPU Usage Percentage	Node CPU Usage is greater than 90%.
High Node Memory Usage	AKS Platform Metric	Memory Working Set Percentage	Node Memory Usage is greater than 90%.
Node is in NotReady state	AKS Platform Metric	Status for various node conditions	Node is in NotReady state for >20 minutes.
SNAT port exhaustion	Load Balancer (LB) Metric	SNAT Connection Count	Filter for Connection State = "Failed"

Application health alerts

Alert scenario	Source	Signal	Recommended threshold
High number of unhealthy pods	Azure Managed Prometheus Metric	Alert name: KubePodReadyStateLow	Available as an AKS recommended alert. To enable this alert, see Recommended alert rules for Kubernetes clusters.
One or more pods are restarting	Azure Managed Prometheus Metric	Alert name: KubePodContainerRestart	Available as an AKS recommended alert. To enable this alert, see Recommended alert rules for Kubernetes clusters.
One or more pods are in CrashLoop status	Azure Managed Prometheus Metric	Alert name: KubePodCrashLooping	Available as an AKS recommended alert. To enable this alert, see Recommended alert rules for Kubernetes clusters.

Kubernetes control plane alerts

Alert scenario	Source	Signal	Recommended threshold
ETCD is Filled Up	Azure Managed Prometheus Metric	etcd_mvcc_db_total_size_in_use_in_bytes	ETCD utilization is greater than 2 GB
API Server Too Many Requests Errors	Azure Managed Prometheus Metric	apiserver_request_total	Filter for error code 429
API Server Webhook and Tunnel Errors	Azure Managed Prometheus Metric	apiserver_request_total	Filter for error codes 500 and 503

Next steps

For more information about monitoring on AKS, see the following articles:

Share via