你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn

Azure Kubernetes 服务 (AKS) 节点中的节点问题检测器 (NPD)

节点问题检测器 (NPD) 是一个开源 Kubernetes 组件,用于检测与节点相关的问题并报告这些问题。 它作为系统服务在群集中的每个节点上运行,并收集各种指标和系统信息,例如 CPU 使用率、磁盘使用率和网络连接性。 检测到问题时,它会生成事件和/或节点条件。 Azure Kubernetes 服务 (AKS) 使用 NPD 监视和管理在 Azure 云平台上运行的 Kubernetes 群集中的节点。 默认情况下,AKS Linux 扩展启用 NPD。

注意

升级到 NPD 的操作独立于节点映像和 Kubernetes 版本升级过程。 如果节点池运行不正常(即处于失败状态),则不会安装新的 NPD 版本。

节点条件

节点条件指示导致节点不可用的永久性问题。 AKS 使用 NPD 中的以下节点条件来公开节点上的永久性问题。 NPD 还会发出相应的 Kubernetes 事件。

问题守护程序类型 NodeCondition 原因
CustomPluginMonitor FilesystemCorruptionProblem FilesystemCorruptionDetected
CustomPluginMonitor KubeletProblem KubeletIsDown
CustomPluginMonitor ContainerRuntimeProblem ContainerRuntimeIsDown
CustomPluginMonitor VMEventScheduled VMEventScheduled
CustomPluginMonitor FrequentUnregisterNetDevice UnregisterNetDevice
CustomPluginMonitor FrequentKubeletRestart FrequentKubeletRestart
CustomPluginMonitor FrequentContainerdRestart FrequentContainerdRestart
CustomPluginMonitor FrequentDockerRestart FrequentDockerRestart
SystemLogMonitor KernelDeadlock DockerHung
SystemLogMonitor ReadonlyFilesystem FilesystemIsReadOnly

事件

NPD 发出包含相关信息的事件,以帮助诊断基础问题。

问题守护程序类型 原因
CustomPluginMonitor EgressBlocked
CustomPluginMonitor FilesystemCorruptionDetected
CustomPluginMonitor KubeletIsDown
CustomPluginMonitor ContainerRuntimeIsDown
CustomPluginMonitor FreezeScheduled
CustomPluginMonitor RebootScheduled
CustomPluginMonitor RedeployScheduled
CustomPluginMonitor TerminateScheduled
CustomPluginMonitor PreemptScheduled
CustomPluginMonitor DNSProblem
CustomPluginMonitor PodIPProblem
SystemLogMonitor OOMKilling
SystemLogMonitor TaskHung
SystemLogMonitor UnregisterNetDevice
SystemLogMonitor KernelOops
SystemLogMonitor DockerSocketCannotConnect
SystemLogMonitor KubeletRPCDeadlineExceeded
SystemLogMonitor KubeletRPCNoSuchContainer
SystemLogMonitor CNICannotStatFS
SystemLogMonitor PLEGUnhealthy
SystemLogMonitor KubeletStart
SystemLogMonitor DockerStart
SystemLogMonitor ContainerdStart

在某些情况下,AKS 会自动封锁并排出节点,以最大程度地减少工作负载的中断。 有关事件和操作的详细信息,请参阅节点自动排出

检查节点条件和事件

  • 使用 kubectl describe node 命令检查节点条件和事件。

    kubectl describe node my-aks-node
    

    输出应该类似于以下示例简洁输出:

    ...
    ...
    
    Conditions:
      Type                          Status  LastHeartbeatTime                 LastTransitionTime                Reason                          Message
      ----                          ------  -----------------                 ------------------                ------                          -------
      VMEventScheduled              False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoVMEventScheduled              VM has no scheduled event
      FrequentContainerdRestart     False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoFrequentContainerdRestart     containerd is functioning properly
      FrequentDockerRestart         False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoFrequentDockerRestart         docker is functioning properly
      FilesystemCorruptionProblem   False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   FilesystemIsOK                  Filesystem is healthy
      FrequentUnregisterNetDevice   False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoFrequentUnregisterNetDevice   node is functioning properly
      ContainerRuntimeProblem       False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:40 +0000   ContainerRuntimeIsUp            container runtime service is up
      KernelDeadlock                False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   KernelHasNoDeadlock             kernel has no deadlock
      FrequentKubeletRestart        False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoFrequentKubeletRestart        kubelet is functioning properly
      KubeletProblem                False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   KubeletIsUp                     kubelet service is up
      ReadonlyFilesystem            False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   FilesystemIsNotReadOnly         Filesystem is not read-only
      NetworkUnavailable            False   Thu, 01 Jun 2023 03:58:39 +0000   Thu, 01 Jun 2023 03:58:39 +0000   RouteCreated                    RouteController created a route
      MemoryPressure                True    Thu, 01 Jun 2023 19:16:50 +0000   Thu, 01 Jun 2023 19:16:50 +0000   KubeletHasInsufficientMemory    kubelet has insufficient memory available
      DiskPressure                  False   Thu, 01 Jun 2023 19:16:50 +0000   Thu, 01 Jun 2023 03:57:22 +0000   KubeletHasNoDiskPressure        kubelet has no disk pressure
      PIDPressure                   False   Thu, 01 Jun 2023 19:16:50 +0000   Thu, 01 Jun 2023 03:57:22 +0000   KubeletHasSufficientPID         kubelet has sufficient PID available
      Ready                         True    Thu, 01 Jun 2023 19:16:50 +0000   Thu, 01 Jun 2023 03:57:23 +0000   KubeletReady                    kubelet is posting ready status. AppArmor enabled
    ...
    ...
    ...
    Events:
      Type    Reason                   Age                  From     Message
      ----    ------                   ----                 ----     -------
      Normal  NodeHasSufficientMemory  94s (x176 over 15h)  kubelet  Node aks-agentpool-40622340-vmss000009 status is now: NodeHasSufficientMemory
    

这些事件也通过 KubeEvents容器见解中提供。

指标

NPD 还根据节点问题公开 Prometheus 指标,以用于监视和警报。 节点 IP 的 20257 端口公开了了这些指标,Prometheus 可以抓取它们。

以下示例 YAML 演示了一个抓取配置,可以作为 DaemonSet 与 Azure 托管 Prometheus 加载项配合使用

kind: ConfigMap
apiVersion: v1
metadata:
  name: ama-metrics-prometheus-config-node
  namespace: kube-system
data:
  prometheus-config: |-
    global:
      scrape_interval: 1m
    scrape_configs:
    - job_name: node-problem-detector
      scrape_interval: 1m
      scheme: http
      metrics_path: /metrics
      relabel_configs:
      - source_labels: [__metrics_path__]
        regex: (.*)
        target_label: metrics_path
      - source_labels: [__address__]
        replacement: '$NODE_NAME'
        target_label: instance
      static_configs:
      - targets: ['$NODE_IP:20257']

以下示例显示了已抓取的指标:

problem_gauge{reason="UnregisterNetDevice",type="FrequentUnregisterNetDevice"} 0
problem_gauge{reason="VMEventScheduled",type="VMEventScheduled"} 0

后续步骤

有关 NPD 的详细信息,请参阅 kubernetes/node-problem-detector