Azure Kubernetes Service (AKS) 節點中的節點問題偵測器 (NPD)
節點問題偵測器 (NPD) 是開放原始碼 Kubernetes 元件,可偵測節點相關問題並報告。 此元件會以在叢集中每個節點上提供服務的 systemd 形式執行,並收集各種計量和系統資訊,例如 CPU 使用量、磁碟使用量和網路連線性。 偵測到問題時,會產生事件和/或節點條件。 Azure Kubernetes Service (AKS) 會使用 NPD 來監視和管理在 Azure 雲端平台上執行的 Kubernetes 叢集中的節點。 AKS Linux 延伸模組預設會啟用 NPD。
注意
升級至 NPD 與節點映像和 Kubernetes 版本升級程序無關。 如果 nodepool 狀況不良 (也就是處於失敗狀態),則不會安裝新的 NPD 版本。
節點條件
節點條件表示使節點無法使用的永久問題。 AKS 會使用 NPD 中的下列節點條件來公開節點上的永久問題。 NPD 也會發出對應的 Kubernetes 事件。
問題精靈類型 | NodeCondition | 原因 |
---|---|---|
CustomPluginMonitor | FilesystemCorruptionProblem | FilesystemCorruptionDetected |
CustomPluginMonitor | KubeletProblem | KubeletIsDown |
CustomPluginMonitor | ContainerRuntimeProblem | ContainerRuntimeIsDown |
CustomPluginMonitor | VMEventScheduled | VMEventScheduled |
CustomPluginMonitor | FrequentUnregisterNetDevice | UnregisterNetDevice |
CustomPluginMonitor | FrequentKubeletRestart | FrequentKubeletRestart |
CustomPluginMonitor | FrequentContainerdRestart | FrequentContainerdRestart |
CustomPluginMonitor | FrequentDockerRestart | FrequentDockerRestart |
SystemLogMonitor | KernelDeadlock | DockerHung |
SystemLogMonitor | ReadonlyFilesystem | FilesystemIsReadOnly |
事件
NPD 會發出具有相關資訊的事件,以協助您診斷基礎問題。
問題精靈類型 | 原因 |
---|---|
CustomPluginMonitor | EgressBlocked |
CustomPluginMonitor | FilesystemCorruptionDetected |
CustomPluginMonitor | KubeletIsDown |
CustomPluginMonitor | ContainerRuntimeIsDown |
CustomPluginMonitor | FreezeScheduled |
CustomPluginMonitor | RebootScheduled |
CustomPluginMonitor | RedeployScheduled |
CustomPluginMonitor | TerminateScheduled |
CustomPluginMonitor | PreemptScheduled |
CustomPluginMonitor | DNSProblem |
CustomPluginMonitor | PodIPProblem |
SystemLogMonitor | OOMKilling |
SystemLogMonitor | TaskHung |
SystemLogMonitor | UnregisterNetDevice |
SystemLogMonitor | KernelOops |
SystemLogMonitor | DockerSocketCannotConnect |
SystemLogMonitor | KubeletRPCDeadlineExceeded |
SystemLogMonitor | KubeletRPCNoSuchContainer |
SystemLogMonitor | CNICannotStatFS |
SystemLogMonitor | PLEGUnhealthy |
SystemLogMonitor | KubeletStart |
SystemLogMonitor | DockerStart |
SystemLogMonitor | ContainerdStart |
在某些情況下,AKS 會自動封鎖並清空節點,以將工作負載中斷降至最低。 如需事件和動作的詳細資訊,請參閱節點自動清空。
檢查節點條件和事件
使用
kubectl describe node
命令檢查節點條件和事件。kubectl describe node my-aks-node
您的輸出看起來應該類似下列範例壓縮的輸出:
... ... Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- VMEventScheduled False Thu, 01 Jun 2023 19:14:25 +0000 Thu, 01 Jun 2023 03:57:41 +0000 NoVMEventScheduled VM has no scheduled event FrequentContainerdRestart False Thu, 01 Jun 2023 19:14:25 +0000 Thu, 01 Jun 2023 03:57:41 +0000 NoFrequentContainerdRestart containerd is functioning properly FrequentDockerRestart False Thu, 01 Jun 2023 19:14:25 +0000 Thu, 01 Jun 2023 03:57:41 +0000 NoFrequentDockerRestart docker is functioning properly FilesystemCorruptionProblem False Thu, 01 Jun 2023 19:14:25 +0000 Thu, 01 Jun 2023 03:57:41 +0000 FilesystemIsOK Filesystem is healthy FrequentUnregisterNetDevice False Thu, 01 Jun 2023 19:14:25 +0000 Thu, 01 Jun 2023 03:57:41 +0000 NoFrequentUnregisterNetDevice node is functioning properly ContainerRuntimeProblem False Thu, 01 Jun 2023 19:14:25 +0000 Thu, 01 Jun 2023 03:57:40 +0000 ContainerRuntimeIsUp container runtime service is up KernelDeadlock False Thu, 01 Jun 2023 19:14:25 +0000 Thu, 01 Jun 2023 03:57:41 +0000 KernelHasNoDeadlock kernel has no deadlock FrequentKubeletRestart False Thu, 01 Jun 2023 19:14:25 +0000 Thu, 01 Jun 2023 03:57:41 +0000 NoFrequentKubeletRestart kubelet is functioning properly KubeletProblem False Thu, 01 Jun 2023 19:14:25 +0000 Thu, 01 Jun 2023 03:57:41 +0000 KubeletIsUp kubelet service is up ReadonlyFilesystem False Thu, 01 Jun 2023 19:14:25 +0000 Thu, 01 Jun 2023 03:57:41 +0000 FilesystemIsNotReadOnly Filesystem is not read-only NetworkUnavailable False Thu, 01 Jun 2023 03:58:39 +0000 Thu, 01 Jun 2023 03:58:39 +0000 RouteCreated RouteController created a route MemoryPressure True Thu, 01 Jun 2023 19:16:50 +0000 Thu, 01 Jun 2023 19:16:50 +0000 KubeletHasInsufficientMemory kubelet has insufficient memory available DiskPressure False Thu, 01 Jun 2023 19:16:50 +0000 Thu, 01 Jun 2023 03:57:22 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Thu, 01 Jun 2023 19:16:50 +0000 Thu, 01 Jun 2023 03:57:22 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Thu, 01 Jun 2023 19:16:50 +0000 Thu, 01 Jun 2023 03:57:23 +0000 KubeletReady kubelet is posting ready status. AppArmor enabled ... ... ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal NodeHasSufficientMemory 94s (x176 over 15h) kubelet Node aks-agentpool-40622340-vmss000009 status is now: NodeHasSufficientMemory
這些事件也可透過 KubeEvents 在 Container Insights 中使用。
計量
NPD 也會根據節點問題公開 Prometheus 計量,您可以用來監視和警示。 這些計量會公開在節點IP的埠 20257 上,Prometheus 可以將其報廢。
以下範例 YAML 顯示了可與 Azure 託管 Prometheus 新增為 DaemonSet 一起使用的抓取設定:
kind: ConfigMap
apiVersion: v1
metadata:
name: ama-metrics-prometheus-config-node
namespace: kube-system
data:
prometheus-config: |-
global:
scrape_interval: 1m
scrape_configs:
- job_name: node-problem-detector
scrape_interval: 1m
scheme: http
metrics_path: /metrics
relabel_configs:
- source_labels: [__metrics_path__]
regex: (.*)
target_label: metrics_path
- source_labels: [__address__]
replacement: '$NODE_NAME'
target_label: instance
static_configs:
- targets: ['$NODE_IP:20257']
下列範例顯示已擷取的計量:
problem_gauge{reason="UnregisterNetDevice",type="FrequentUnregisterNetDevice"} 0
problem_gauge{reason="VMEventScheduled",type="VMEventScheduled"} 0
下一步
如需 NPD 的詳細資訊,請參閱 kubernetes/node-problem-detector。