Cluster Restart Issues w/Kyverno Installed

Golub, Gary (RIS-KOP) 0 Reputation points
2024-12-10T21:32:20.36+00:00

We shut down our development clusters at night to save cost. Recently we deployed Kyverno and now the clusters don't start up correctly. We end up with numerous deployments with ReplicaFailure status. Many pods that do start up have timeout issues such as these:

linkerd:
[   327.709562s]  WARN ThreadId(01) watch{port=4191}: linkerd_app_inbound::policy::api: Unexpected policy controller response; retrying with a backoff grpc.status=Deadline expired before operation could complete grpc.message="initial item not received within timeout"
[   327.799936s]  WARN ThreadId(02) identity:identity{server.addr=localhost:8080}:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8080: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
 
cert-manager
{"ts":1733843601094.5713,"caller":"app/controller.go:218","msg":"error checking if certificates.cert-manager.io CRD is installed","logger":"cert-manager","err":"failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://172.20.0.1:443/apis/apiextensions.k8s.io/v1\": dial tcp 172.20.0.1:443: i/o timeout"}
{"ts":1733843601094.646,"caller":"app/controller.go:225","msg":"error retrieving certificate.cert-manager.io CRDs","logger":"cert-manager","err":"failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://172.20.0.1:443/apis/apiextensions.k8s.io/v1\": dial tcp 172.20.0.1:443: i/o timeout"}
{"ts":1733843601094.6948,"caller":"./main.go:43","msg":"error executing command","logger":"cert-manager","err":"failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://172.20.0.1:443/apis/apiextensions.k8s.io/v1\": dial tcp 172.20.0.1:443: i/o timeout"}

What appears to be happening is that the Kyverno pods aren't yet available and the webhook fails. Once that happens, I can manually remediate this by removing the validatingwebhookconfiguration and restarting the failed deployments.

I can avoid the issue entirely by changing the policy failure from "Fail" to "Ignore".

Why does the Kyverno failure cause problems in other namespaces? I'd like to understand the flow of what is happening. Is something "not ready" when the admission controller tries to call the Kyverno pods?

Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
2,205 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Lijitha B 510 Reputation points Microsoft Vendor
    2024-12-11T12:04:50.09+00:00

    Hi Golub, Gary (RIS-KOP),

    Welcome to the Microsoft Q&A Platform! Thank you for asking your question here.

    According to the documentation Kyverno uses kustomize-style overlays for validation, supports JSON patch and strategic merge patch for mutation, and can clone resources across namespaces based on flexible triggers. You can deploy policies individually by using their YAML manifests, or package and deploy them by using Helm charts. Kyverno

    You need to ensure that Kyverno is fully initialized before it starts processing requests from the Kubernetes API. When the cluster restarts, Kyverno may not be ready in time to handle incoming validation requests, causing timeouts and failures. API server is blocked

    For Processes for troubleshooting and recovery of Kyverno documentation: Troubleshooting

    If the answer is helpful, please click "Accept Answer" and "Upvote it."

    If you have any further queries, do let us know.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.