Narzędzie do wykrywania problemów węzła (NPD) w węzłach usługi Azure Kubernetes Service (AKS)

Artykuł
08/02/2024

Narzędzie do wykrywania problemów z węzłem (NPD) to składnik kubernetes typu open source, który wykrywa problemy związane z węzłami i zgłasza je. Działa jako system serwisowany w każdym węźle w klastrze i zbiera różne metryki i informacje o systemie, takie jak użycie procesora CPU, użycie dysku i łączność sieciowa. Gdy wykryje problem, generuje zdarzenia i/lub warunki węzła. Usługa Azure Kubernetes Service (AKS) używa NPD do monitorowania węzłów w klastrze Kubernetes działającego na platformie Azure w chmurze i zarządzania nimi. Rozszerzenie AKS dla systemu Linux domyślnie włącza NPD.

Uwaga

Uaktualnienia do serwera NPD są niezależne od obrazu węzła i procesów uaktualniania wersji platformy Kubernetes. Jeśli pula węzłów jest w złej kondycji (tj. w stanie niepowodzenia), nowe wersje serwera NPD nie zostaną zainstalowane.

Warunki węzła

Warunki węzła wskazują stały problem, który sprawia, że węzeł jest niedostępny. Usługa AKS używa następujących warunków węzła z serwera NPD do uwidaczniania stałych problemów w węźle. Serwer NPD emituje również odpowiednie zdarzenia kubernetes.

Typ demona problemu	NodeCondition	Przyczyna
CustomPluginMonitor	FilesystemCorruptionProblem	FilesystemCorruptionDetected
CustomPluginMonitor	KubeletProblem	KubeletIsDown
CustomPluginMonitor	ContainerRuntimeProblem	ContainerRuntimeIsDown
CustomPluginMonitor	VmEventScheduled	VmEventScheduled
CustomPluginMonitor	FrequentUnregisterNetDevice	WyrejestrowywanieNetDevice
CustomPluginMonitor	FrequentKubeletRestart	FrequentKubeletRestart
CustomPluginMonitor	FrequentContainerdRestart	FrequentContainerdRestart
CustomPluginMonitor	FrequentDockerRestart	FrequentDockerRestart
SystemLogMonitor	KernelDeadlock	DockerHung
SystemLogMonitor	ReadonlyFilesystem	System plikówIsReadOnly

Zdarzenia

NpD emituje zdarzenia z odpowiednimi informacjami, aby ułatwić diagnozowanie podstawowych problemów.

Typ demona problemu	Przyczyna
CustomPluginMonitor	Ruch wychodzącyBlocked
CustomPluginMonitor	FilesystemCorruptionDetected
CustomPluginMonitor	KubeletIsDown
CustomPluginMonitor	ContainerRuntimeIsDown
CustomPluginMonitor	Zamrożenie zablokowania
CustomPluginMonitor	Ponowne uruchomienieSchedulowane
CustomPluginMonitor	Ponowne wdrażanieScheduleduled
CustomPluginMonitor	Zakończenieschedulowane
CustomPluginMonitor	PreemptScheduleded
CustomPluginMonitor	DnsProblem
CustomPluginMonitor	PodIPProblem
SystemLogMonitor	OOMKilling
SystemLogMonitor	Zadanie
SystemLogMonitor	WyrejestrowywanieNetDevice
SystemLogMonitor	JądraOops
SystemLogMonitor	DockerSocketCannotConnect
SystemLogMonitor	KubeletRPCDeadlineExceeded
SystemLogMonitor	KubeletRPCNoSuchContainer
SystemLogMonitor	CNICannotStatFS
SystemLogMonitor	PLEG w złej kondycji
SystemLogMonitor	KubeletStart
SystemLogMonitor	DockerStart
SystemLogMonitor	ContainerdStart

W niektórych przypadkach usługa AKS automatycznie kordonuje i opróżnia węzeł w celu zminimalizowania zakłóceń w obciążeniach. Aby uzyskać więcej informacji na temat zdarzeń i akcji, zobacz Automatyczne opróżnianie węzła.

Sprawdzanie warunków i wydarzeń węzła

Sprawdź warunki i zdarzenia węzła kubectl describe node przy użyciu polecenia .

kubectl describe node my-aks-node

Dane wyjściowe powinny wyglądać podobnie do następujących przykładowych skondensowanych danych wyjściowych:

...
...

Conditions:
  Type                          Status  LastHeartbeatTime                 LastTransitionTime                Reason                          Message
  ----                          ------  -----------------                 ------------------                ------                          -------
  VMEventScheduled              False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoVMEventScheduled              VM has no scheduled event
  FrequentContainerdRestart     False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoFrequentContainerdRestart     containerd is functioning properly
  FrequentDockerRestart         False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoFrequentDockerRestart         docker is functioning properly
  FilesystemCorruptionProblem   False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   FilesystemIsOK                  Filesystem is healthy
  FrequentUnregisterNetDevice   False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoFrequentUnregisterNetDevice   node is functioning properly
  ContainerRuntimeProblem       False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:40 +0000   ContainerRuntimeIsUp            container runtime service is up
  KernelDeadlock                False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   KernelHasNoDeadlock             kernel has no deadlock
  FrequentKubeletRestart        False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoFrequentKubeletRestart        kubelet is functioning properly
  KubeletProblem                False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   KubeletIsUp                     kubelet service is up
  ReadonlyFilesystem            False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   FilesystemIsNotReadOnly         Filesystem is not read-only
  NetworkUnavailable            False   Thu, 01 Jun 2023 03:58:39 +0000   Thu, 01 Jun 2023 03:58:39 +0000   RouteCreated                    RouteController created a route
  MemoryPressure                True    Thu, 01 Jun 2023 19:16:50 +0000   Thu, 01 Jun 2023 19:16:50 +0000   KubeletHasInsufficientMemory    kubelet has insufficient memory available
  DiskPressure                  False   Thu, 01 Jun 2023 19:16:50 +0000   Thu, 01 Jun 2023 03:57:22 +0000   KubeletHasNoDiskPressure        kubelet has no disk pressure
  PIDPressure                   False   Thu, 01 Jun 2023 19:16:50 +0000   Thu, 01 Jun 2023 03:57:22 +0000   KubeletHasSufficientPID         kubelet has sufficient PID available
  Ready                         True    Thu, 01 Jun 2023 19:16:50 +0000   Thu, 01 Jun 2023 03:57:23 +0000   KubeletReady                    kubelet is posting ready status. AppArmor enabled
...
...
...
Events:
  Type    Reason                   Age                  From     Message
  ----    ------                   ----                 ----     -------
  Normal  NodeHasSufficientMemory  94s (x176 over 15h)  kubelet  Node aks-agentpool-40622340-vmss000009 status is now: NodeHasSufficientMemory

Te zdarzenia są również dostępne w usłudze Container Insights za pośrednictwem rozwiązania KubeEvents.

Metryki

NpD uwidacznia również metryki Rozwiązania Prometheus na podstawie problemów z węzłem, których można użyć do monitorowania i zgłaszania alertów. Te metryki są widoczne na porcie 20257 adresu IP węzła, a rozwiązanie Prometheus może je zeskrobać.

W poniższym przykładzie YAML przedstawiono konfigurację zeskropka, której można użyć z dodatkiem Prometheus zarządzanym przez platformę Azure jako element DaemonSet:

kind: ConfigMap
apiVersion: v1
metadata:
  name: ama-metrics-prometheus-config-node
  namespace: kube-system
data:
  prometheus-config: |-
    global:
      scrape_interval: 1m
    scrape_configs:
    - job_name: node-problem-detector
      scrape_interval: 1m
      scheme: http
      metrics_path: /metrics
      relabel_configs:
      - source_labels: [__metrics_path__]
        regex: (.*)
        target_label: metrics_path
      - source_labels: [__address__]
        replacement: '$NODE_NAME'
        target_label: instance
      static_configs:
      - targets: ['$NODE_IP:20257']

W poniższym przykładzie przedstawiono metryki zezłomowane:

problem_gauge{reason="UnregisterNetDevice",type="FrequentUnregisterNetDevice"} 0
problem_gauge{reason="VMEventScheduled",type="VMEventScheduled"} 0

Następne kroki

Aby uzyskać więcej informacji na temat serwera NPD, zobacz kubernetes/node-problem-detector.

Udostępnij za pośrednictwem