We shut down our development clusters at night to save cost. Recently we deployed Kyverno and now the clusters don't start up correctly. We end up with numerous deployments with ReplicaFailure status. Many pods that do start up have timeout issues such as these:
linkerd:
[ 327.709562s] WARN ThreadId(01) watch{port=4191}: linkerd_app_inbound::policy::api: Unexpected policy controller response; retrying with a backoff grpc.status=Deadline expired before operation could complete grpc.message="initial item not received within timeout"
[ 327.799936s] WARN ThreadId(02) identity:identity{server.addr=localhost:8080}:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8080: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
cert-manager
{"ts":1733843601094.5713,"caller":"app/controller.go:218","msg":"error checking if certificates.cert-manager.io CRD is installed","logger":"cert-manager","err":"failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://172.20.0.1:443/apis/apiextensions.k8s.io/v1\": dial tcp 172.20.0.1:443: i/o timeout"}
{"ts":1733843601094.646,"caller":"app/controller.go:225","msg":"error retrieving certificate.cert-manager.io CRDs","logger":"cert-manager","err":"failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://172.20.0.1:443/apis/apiextensions.k8s.io/v1\": dial tcp 172.20.0.1:443: i/o timeout"}
{"ts":1733843601094.6948,"caller":"./main.go:43","msg":"error executing command","logger":"cert-manager","err":"failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://172.20.0.1:443/apis/apiextensions.k8s.io/v1\": dial tcp 172.20.0.1:443: i/o timeout"}
What appears to be happening is that the Kyverno pods aren't yet available and the webhook fails. Once that happens, I can manually remediate this by removing the validatingwebhookconfiguration and restarting the failed deployments.
I can avoid the issue entirely by changing the policy failure from "Fail" to "Ignore".
Why does the Kyverno failure cause problems in other namespaces? I'd like to understand the flow of what is happening. Is something "not ready" when the admission controller tries to call the Kyverno pods?