Nodproblemidentifiering (NPD) i Azure Kubernetes Service-noder (AKS)

Artikel
11/09/2024

Nodproblemidentifiering (NPD) är en öppen källkod Kubernetes-komponent som identifierar nodrelaterade problem och rapporter om dem. Den körs som en systemdelad tjänst på varje nod i klustret och samlar in olika mått och systeminformation, till exempel CPU-användning, diskanvändning och nätverksanslutning. När ett problem identifieras genereras händelser och/eller nodvillkor. Azure Kubernetes Service (AKS) använder NPD för att övervaka och hantera noder i ett Kubernetes-kluster som körs på Azure-molnplattformen. AKS Linux-tillägget aktiverar NPD som standard.

Kommentar

Uppgraderingar till NPD är oberoende av nodavbildningen och Uppgraderingsprocesserna för Kubernetes-versionen. Om en nodpool inte är felfri (dvs. i ett misslyckat tillstånd) installeras inte nya NPD-versioner.

Nodvillkor

Nodvillkor anger ett permanent problem som gör noden otillgänglig. AKS använder följande nodvillkor från NPD för att exponera permanenta problem på noden. NPD genererar även motsvarande Kubernetes-händelser.

Problem med daemontyp	NodeCondition	Anledning
CustomPluginMonitor	FilesystemCorruptionProblem	FilesystemCorruptionDetected
CustomPluginMonitor	KubeletProblem	KubeletIsDown
CustomPluginMonitor	ContainerRuntimeProblem	ContainerRuntimeIsDown
CustomPluginMonitor	VMEventScheduled	VMEventScheduled
CustomPluginMonitor	FrequentUnregisterNetDevice	AvregistreraNetEnhet
CustomPluginMonitor	FrequentKubeletRestart	FrequentKubeletRestart
CustomPluginMonitor	FrequentContainerdRestart	FrequentContainerdRestart
CustomPluginMonitor	FrequentDockerRestart	FrequentDockerRestart
SystemLogMonitor	KernelDeadlock	DockerHung
SystemLogMonitor	ReadonlyFilesystem	FilesystemIsReadOnly

Händelser

NPD genererar händelser med relevant information som hjälper dig att diagnostisera underliggande problem.

Problem med daemontyp	Anledning
CustomPluginMonitor	EgressBlocked
CustomPluginMonitor	FilesystemCorruptionDetected
CustomPluginMonitor	KubeletIsDown
CustomPluginMonitor	ContainerRuntimeIsDown
CustomPluginMonitor	FreezeScheduled
CustomPluginMonitor	RebootScheduled
CustomPluginMonitor	OmdistribueraScheduled
CustomPluginMonitor	TerminateScheduled
CustomPluginMonitor	PreemptScheduled
CustomPluginMonitor	DNSProblem
CustomPluginMonitor	PodIPProblem
SystemLogMonitor	OOMKilling
SystemLogMonitor	TaskHung
SystemLogMonitor	AvregistreraNetEnhet
SystemLogMonitor	KernelOops
SystemLogMonitor	DockerSocketCannotConnect
SystemLogMonitor	KubeletRPCDeadlineExceeded
SystemLogMonitor	KubeletRPCNoSuchContainer
SystemLogMonitor	CNICannotStatFS
SystemLogMonitor	PLEGUnhealthy
SystemLogMonitor	KubeletStart
SystemLogMonitor	DockerStart
SystemLogMonitor	ContainerdStart

I vissa fall spärrar OCH tömmer AKS automatiskt noden för att minimera störningar i arbetsbelastningar. Mer information om händelser och åtgärder finns i Automatisk tömning av nod.

Kontrollera nodvillkoren och händelserna

Kontrollera nodvillkoren och -händelserna med kommandot kubectl describe node .

kubectl describe node my-aks-node

Dina utdata bör se ut ungefär som i följande exempel på komprimerade utdata:

...
...

Conditions:
  Type                          Status  LastHeartbeatTime                 LastTransitionTime                Reason                          Message
  ----                          ------  -----------------                 ------------------                ------                          -------
  VMEventScheduled              False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoVMEventScheduled              VM has no scheduled event
  FrequentContainerdRestart     False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoFrequentContainerdRestart     containerd is functioning properly
  FrequentDockerRestart         False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoFrequentDockerRestart         docker is functioning properly
  FilesystemCorruptionProblem   False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   FilesystemIsOK                  Filesystem is healthy
  FrequentUnregisterNetDevice   False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoFrequentUnregisterNetDevice   node is functioning properly
  ContainerRuntimeProblem       False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:40 +0000   ContainerRuntimeIsUp            container runtime service is up
  KernelDeadlock                False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   KernelHasNoDeadlock             kernel has no deadlock
  FrequentKubeletRestart        False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoFrequentKubeletRestart        kubelet is functioning properly
  KubeletProblem                False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   KubeletIsUp                     kubelet service is up
  ReadonlyFilesystem            False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   FilesystemIsNotReadOnly         Filesystem is not read-only
  NetworkUnavailable            False   Thu, 01 Jun 2023 03:58:39 +0000   Thu, 01 Jun 2023 03:58:39 +0000   RouteCreated                    RouteController created a route
  MemoryPressure                True    Thu, 01 Jun 2023 19:16:50 +0000   Thu, 01 Jun 2023 19:16:50 +0000   KubeletHasInsufficientMemory    kubelet has insufficient memory available
  DiskPressure                  False   Thu, 01 Jun 2023 19:16:50 +0000   Thu, 01 Jun 2023 03:57:22 +0000   KubeletHasNoDiskPressure        kubelet has no disk pressure
  PIDPressure                   False   Thu, 01 Jun 2023 19:16:50 +0000   Thu, 01 Jun 2023 03:57:22 +0000   KubeletHasSufficientPID         kubelet has sufficient PID available
  Ready                         True    Thu, 01 Jun 2023 19:16:50 +0000   Thu, 01 Jun 2023 03:57:23 +0000   KubeletReady                    kubelet is posting ready status. AppArmor enabled
...
...
...
Events:
  Type    Reason                   Age                  From     Message
  ----    ------                   ----                 ----     -------
  Normal  NodeHasSufficientMemory  94s (x176 over 15h)  kubelet  Node aks-agentpool-40622340-vmss000009 status is now: NodeHasSufficientMemory

Dessa händelser är också tillgängliga i Container Insights via KubeEvents.

Mått

NPD exponerar även Prometheus-mått baserat på nodproblem, som du kan använda för övervakning och aviseringar. Dessa mått exponeras på port 20257 för nod-IP och Prometheus kan skrapa dem.

I följande exempel visar YAML en skrapkonfiguration som du kan använda med Azure Managed Prometheus-tillägget som en DaemonSet:

kind: ConfigMap
apiVersion: v1
metadata:
  name: ama-metrics-prometheus-config-node
  namespace: kube-system
data:
  prometheus-config: |-
    global:
      scrape_interval: 1m
    scrape_configs:
    - job_name: node-problem-detector
      scrape_interval: 1m
      scheme: http
      metrics_path: /metrics
      relabel_configs:
      - source_labels: [__metrics_path__]
        regex: (.*)
        target_label: metrics_path
      - source_labels: [__address__]
        replacement: '$NODE_NAME'
        target_label: instance
      static_configs:
      - targets: ['$NODE_IP:20257']

I följande exempel visas de skrapade måtten:

problem_gauge{reason="UnregisterNetDevice",type="FrequentUnregisterNetDevice"} 0
problem_gauge{reason="VMEventScheduled",type="VMEventScheduled"} 0

Nästa steg

Mer information om NPD finns i kubernetes/node-problem-detector.

Dela via