Cluster Restart Issues w/Kyverno Installed

Question

We shut down our development clusters at night to save cost. Recently we deployed Kyverno and now the clusters don't start up correctly. We end up with numerous deployments with ReplicaFailure status. Many pods that do start up have timeout issues such as these:

linkerd:
[   327.709562s]  WARN ThreadId(01) watch{port=4191}: linkerd_app_inbound::policy::api: Unexpected policy controller response; retrying with a backoff grpc.status=Deadline expired before operation could complete grpc.message="initial item not received within timeout"
[   327.799936s]  WARN ThreadId(02) identity:identity{server.addr=localhost:8080}:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8080: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
 
cert-manager
{"ts":1733843601094.5713,"caller":"app/controller.go:218","msg":"error checking if certificates.cert-manager.io CRD is installed","logger":"cert-manager","err":"failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://172.20.0.1:443/apis/apiextensions.k8s.io/v1\": dial tcp 172.20.0.1:443: i/o timeout"}
{"ts":1733843601094.646,"caller":"app/controller.go:225","msg":"error retrieving certificate.cert-manager.io CRDs","logger":"cert-manager","err":"failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://172.20.0.1:443/apis/apiextensions.k8s.io/v1\": dial tcp 172.20.0.1:443: i/o timeout"}
{"ts":1733843601094.6948,"caller":"./main.go:43","msg":"error executing command","logger":"cert-manager","err":"failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://172.20.0.1:443/apis/apiextensions.k8s.io/v1\": dial tcp 172.20.0.1:443: i/o timeout"}

What appears to be happening is that the Kyverno pods aren't yet available and the webhook fails. Once that happens, I can manually remediate this by removing the validatingwebhookconfiguration and restarting the failed deployments.

I can avoid the issue entirely by changing the policy failure from "Fail" to "Ignore".

Why does the Kyverno failure cause problems in other namespaces? I'd like to understand the flow of what is happening. Is something "not ready" when the admission controller tries to call the Kyverno pods?

Answer

Hi Golub, Gary (RIS-KOP),

Welcome to the Microsoft Q&A Platform! Thank you for asking your question here.

According to the documentation Kyverno uses kustomize-style overlays for validation, supports JSON patch and strategic merge patch for mutation, and can clone resources across namespaces based on flexible triggers. You can deploy policies individually by using their YAML manifests, or package and deploy them by using Helm charts. Kyverno

You need to ensure that Kyverno is fully initialized before it starts processing requests from the Kubernetes API. When the cluster restarts, Kyverno may not be ready in time to handle incoming validation requests, causing timeouts and failures. API server is blocked

For Processes for troubleshooting and recovery of Kyverno documentation: Troubleshooting

If the answer is helpful, please click "Accept Answer" and "Upvote it."

If you have any further queries, do let us know.

Share via

Cluster Restart Issues w/Kyverno Installed

1 answer

Your answer