Azure Kubernetes Service (AKS) 節點中的節點問題偵測器 (NPD)

發行項
08/02/2024

節點問題偵測器 (NPD) 是開放原始碼 Kubernetes 元件，可偵測節點相關問題並報告。此元件會以在叢集中每個節點上提供服務的 systemd 形式執行，並收集各種計量和系統資訊，例如 CPU 使用量、磁碟使用量和網路連線性。偵測到問題時，會產生事件和/或節點條件。 Azure Kubernetes Service (AKS) 會使用 NPD 來監視和管理在 Azure 雲端平台上執行的 Kubernetes 叢集中的節點。 AKS Linux 延伸模組預設會啟用 NPD。

注意

升級至 NPD 與節點映像和 Kubernetes 版本升級程序無關。如果 nodepool 狀況不良 (也就是處於失敗狀態)，則不會安裝新的 NPD 版本。

節點條件

節點條件表示使節點無法使用的永久問題。 AKS 會使用 NPD 中的下列節點條件來公開節點上的永久問題。 NPD 也會發出對應的 Kubernetes 事件。

問題精靈類型	NodeCondition	原因
CustomPluginMonitor	FilesystemCorruptionProblem	FilesystemCorruptionDetected
CustomPluginMonitor	KubeletProblem	KubeletIsDown
CustomPluginMonitor	ContainerRuntimeProblem	ContainerRuntimeIsDown
CustomPluginMonitor	VMEventScheduled	VMEventScheduled
CustomPluginMonitor	FrequentUnregisterNetDevice	UnregisterNetDevice
CustomPluginMonitor	FrequentKubeletRestart	FrequentKubeletRestart
CustomPluginMonitor	FrequentContainerdRestart	FrequentContainerdRestart
CustomPluginMonitor	FrequentDockerRestart	FrequentDockerRestart
SystemLogMonitor	KernelDeadlock	DockerHung
SystemLogMonitor	ReadonlyFilesystem	FilesystemIsReadOnly

事件

NPD 會發出具有相關資訊的事件，以協助您診斷基礎問題。

問題精靈類型	原因
CustomPluginMonitor	EgressBlocked
CustomPluginMonitor	FilesystemCorruptionDetected
CustomPluginMonitor	KubeletIsDown
CustomPluginMonitor	ContainerRuntimeIsDown
CustomPluginMonitor	FreezeScheduled
CustomPluginMonitor	RebootScheduled
CustomPluginMonitor	RedeployScheduled
CustomPluginMonitor	TerminateScheduled
CustomPluginMonitor	PreemptScheduled
CustomPluginMonitor	DNSProblem
CustomPluginMonitor	PodIPProblem
SystemLogMonitor	OOMKilling
SystemLogMonitor	TaskHung
SystemLogMonitor	UnregisterNetDevice
SystemLogMonitor	KernelOops
SystemLogMonitor	DockerSocketCannotConnect
SystemLogMonitor	KubeletRPCDeadlineExceeded
SystemLogMonitor	KubeletRPCNoSuchContainer
SystemLogMonitor	CNICannotStatFS
SystemLogMonitor	PLEGUnhealthy
SystemLogMonitor	KubeletStart
SystemLogMonitor	DockerStart
SystemLogMonitor	ContainerdStart

在某些情況下，AKS 會自動封鎖並清空節點，以將工作負載中斷降至最低。如需事件和動作的詳細資訊，請參閱節點自動清空。

檢查節點條件和事件

使用 kubectl describe node 命令檢查節點條件和事件。

kubectl describe node my-aks-node

您的輸出看起來應該類似下列範例壓縮的輸出：

...
...

Conditions:
  Type                          Status  LastHeartbeatTime                 LastTransitionTime                Reason                          Message
  ----                          ------  -----------------                 ------------------                ------                          -------
  VMEventScheduled              False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoVMEventScheduled              VM has no scheduled event
  FrequentContainerdRestart     False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoFrequentContainerdRestart     containerd is functioning properly
  FrequentDockerRestart         False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoFrequentDockerRestart         docker is functioning properly
  FilesystemCorruptionProblem   False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   FilesystemIsOK                  Filesystem is healthy
  FrequentUnregisterNetDevice   False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoFrequentUnregisterNetDevice   node is functioning properly
  ContainerRuntimeProblem       False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:40 +0000   ContainerRuntimeIsUp            container runtime service is up
  KernelDeadlock                False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   KernelHasNoDeadlock             kernel has no deadlock
  FrequentKubeletRestart        False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoFrequentKubeletRestart        kubelet is functioning properly
  KubeletProblem                False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   KubeletIsUp                     kubelet service is up
  ReadonlyFilesystem            False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   FilesystemIsNotReadOnly         Filesystem is not read-only
  NetworkUnavailable            False   Thu, 01 Jun 2023 03:58:39 +0000   Thu, 01 Jun 2023 03:58:39 +0000   RouteCreated                    RouteController created a route
  MemoryPressure                True    Thu, 01 Jun 2023 19:16:50 +0000   Thu, 01 Jun 2023 19:16:50 +0000   KubeletHasInsufficientMemory    kubelet has insufficient memory available
  DiskPressure                  False   Thu, 01 Jun 2023 19:16:50 +0000   Thu, 01 Jun 2023 03:57:22 +0000   KubeletHasNoDiskPressure        kubelet has no disk pressure
  PIDPressure                   False   Thu, 01 Jun 2023 19:16:50 +0000   Thu, 01 Jun 2023 03:57:22 +0000   KubeletHasSufficientPID         kubelet has sufficient PID available
  Ready                         True    Thu, 01 Jun 2023 19:16:50 +0000   Thu, 01 Jun 2023 03:57:23 +0000   KubeletReady                    kubelet is posting ready status. AppArmor enabled
...
...
...
Events:
  Type    Reason                   Age                  From     Message
  ----    ------                   ----                 ----     -------
  Normal  NodeHasSufficientMemory  94s (x176 over 15h)  kubelet  Node aks-agentpool-40622340-vmss000009 status is now: NodeHasSufficientMemory

這些事件也可透過 KubeEvents 在 Container Insights 中使用。

計量

NPD 也會根據節點問題公開 Prometheus 計量，您可以用來監視和警示。這些計量會公開在節點IP的埠 20257 上，Prometheus 可以將其報廢。

以下範例 YAML 顯示了可與 Azure 託管 Prometheus 新增為 DaemonSet 一起使用的抓取設定：

kind: ConfigMap
apiVersion: v1
metadata:
  name: ama-metrics-prometheus-config-node
  namespace: kube-system
data:
  prometheus-config: |-
    global:
      scrape_interval: 1m
    scrape_configs:
    - job_name: node-problem-detector
      scrape_interval: 1m
      scheme: http
      metrics_path: /metrics
      relabel_configs:
      - source_labels: [__metrics_path__]
        regex: (.*)
        target_label: metrics_path
      - source_labels: [__address__]
        replacement: '$NODE_NAME'
        target_label: instance
      static_configs:
      - targets: ['$NODE_IP:20257']

下列範例顯示已擷取的計量：

problem_gauge{reason="UnregisterNetDevice",type="FrequentUnregisterNetDevice"} 0
problem_gauge{reason="VMEventScheduled",type="VMEventScheduled"} 0

下一步

如需 NPD 的詳細資訊，請參閱 kubernetes/node-problem-detector。

共用方式為

Azure Kubernetes Service (AKS) 節點中的節點問題偵測器 (NPD)

節點條件

事件

檢查節點條件和事件

計量

下一步

其他資源