Voozh

The kubectl patch came back as a webhook timeout, not a credentials error. That was the moment the incident stopped being about a rotated MongoDB password and started being about the admission layer. A ValidatingWebhookConfiguration with failurePolicy: Fail was pointed at a webhook pod that was crash-looping on a bad liveness probe, and the only way to fix the webhook pod was to patch a ConfigMap that the webhook itself was supposed to validate. The safety mechanism had become the outage. Our profile service was down because its database credentials were stale, the fix for the credentials was a one-line patch, and that one-line patch could not be applied because the thing that was supposed to keep ConfigMaps safe was rejecting every write in the namespace.

Problem signals:

kubectl patch or apply on a ConfigMap returns a webhook timeout or failed calling webhook error instead of a normal validation error
An admission webhook pod is CrashLoopBackOff while its ValidatingWebhookConfiguration is set to failurePolicy: Fail
ArgoCD shows sync pending or OutOfSync on resources in the affected namespace and the sync will not progress
A workload reads stale config from a ConfigMap that was supposedly already updated, and a Deployment-level env var is shadowing the mounted value
Compliance requires that admission webhooks remain failurePolicy: Fail in production, so flipping to Ignore as a workaround is itself an audit event

We thought it was a credentials incident for the first 20 minutes

The patch that came back as a webhook timeout

The page that started the call was a profile service returning 500s on /health. The cause looked obvious. The data layer team had rotated the MongoDB credentials the day before, and the live ConfigMap in the application namespace still held the old password. There was a backup ConfigMap sitting next to it with the rotated values, labelled exactly the way the runbook described. The fix was supposed to be a thirty second kubectl patch.

It was not. The patch came back with this:

$ kubectl -n app patch configmap profile-mongodb-config \
 --type merge --patch-file rotated.yaml
Error from server (InternalError): Internal error occurred:
failed calling webhook "configmap-validator.app.svc":
Post "https://configmap-validator.app.svc:443/validate?timeout=10s":
dial tcp 10.42.7.83:443: connect: connection refused

The actual error. Not a credentials problem, an admission control problem.

That error string is the whole story. The cluster had a ValidatingWebhookConfiguration named configmap-validator that intercepted every ConfigMap write in the namespace. The webhook pod was supposed to enforce a schema policy that the compliance team owned. Right now the webhook pod was not answering on its service IP, which meant every ConfigMap write was failing closed, which meant our credential fix was failing closed, which meant the profile service stayed down.

We had walked into this kind of shape before, but usually on the cert-manager side. This time the trap was tighter: the webhook was supposed to validate the very ConfigMaps that controlled the workloads in its own namespace, and one of those workloads happened to be down for an unrelated reason. Two independent failures had stacked into a deadlock.

Why the pod was crash-looping and why nothing could fix it in place

The webhook was guarding its own ConfigMap

kubectl describe on the configmap-validator pod told us the liveness probe was failing. The pod was getting killed every 30 seconds, restarting, getting killed again. The probe was hitting /healthz on port 8443. The actual application served its health endpoint on /health, no z. Someone had copy-pasted a probe spec from an older service months ago and nobody had noticed because the webhook had been running fine until a recent image bump shifted the health route.

Fixing a Deployment in Kubernetes is normally a kubectl edit or a kubectl set probe and you are done. That was not available to us. The webhook configuration intercepted ConfigMap writes, not Deployment writes, so technically we could have patched the Deployment directly. Except the Deployment mounted a ConfigMap for its own startup arguments, and our platform team had a hard rule against in-cluster edits that drifted from the GitOps source. ArgoCD would self-heal the Deployment back to the broken probe spec inside 90 seconds.

We needed the correct health path, and the compliance team kept the canonical values in a separate namespace. Their ConfigMap held the approved liveness path, the approved annotation policy, and the compliance acknowledgement token that any incident response was required to reference. We pulled it:

$ kubectl -n compliance get configmap webhook-standards -o yaml
apiVersion: v1
kind: ConfigMap
metadata:
 name: webhook-standards
 namespace: compliance
data:
 liveness-path: "/health"
 readiness-path: "/ready"
 failure-policy: "Fail"
 ack-token: "COMP-ACK-7c3f9a-2024Q4"
 required-annotations: |
 incident.compliance/id
 incident.compliance/services
 incident.compliance/ack-token

The compliance source of truth. We read these values; we did not retype them.

Flip failurePolicy to Ignore, or delete the ValidatingWebhookConfiguration?

Choosing the smaller blast radius

There were two ways to unblock ConfigMap writes. We could delete the ValidatingWebhookConfiguration entirely, fix everything, and recreate it from the GitOps source. Or we could patch failurePolicy from Fail to Ignore for the length of the recovery and patch it back when we were done.

Step	What it does
Option A. Delete the webhook configuration	Cleanest cut. ConfigMap writes unblock instantly. Risk: an unrelated team applies an out-of-policy ConfigMap during the window and we do not catch it. Also generates a louder audit event because the object disappears from etcd.
Option B. Patch failurePolicy to Ignore	Webhook is still called; if the pod is up it still validates; if it is down, writes pass. Smaller blast radius. Audit log shows a field change, not a delete. We picked this one.

Option B won because of the audit trail. The compliance team would rather see one field flip and one field flip back, with the same controller object identity across the incident, than see a delete and a recreate with a new resourceVersion lineage. That is the kind of preference you only learn by sitting through an audit. We have written more about this kind of constraint in our infrastructure audit readiness work.

# Step 1. Snapshot the current webhook config so we can prove what we changed.
kubectl get validatingwebhookconfiguration configmap-validator \
 -o yaml > /tmp/vwc-before.yaml

# Step 2. Flip failurePolicy to Ignore, scoped to this single webhook entry.
kubectl patch validatingwebhookconfiguration configmap-validator \
 --type='json' \
 -p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Ignore"}]'

# Step 3. Confirm the change before touching any ConfigMap.
kubectl get validatingwebhookconfiguration configmap-validator \
 -o jsonpath='{.webhooks[0].failurePolicy}'
# expect: Ignore

The unblock. Three commands, one of which is a snapshot for the post-incident review.

We patched the ConfigMap and the service was still broken

The env var that made the credential fix invisible

With failurePolicy on Ignore, the credential patch went through. We pulled the rotated values from the backup ConfigMap, applied them to the live one, and watched the profile service pods restart. /health still returned 500. The MongoDB connection error in the application logs still showed the old username.

That was the second moment in the incident where the model of the world had to change. The ConfigMap held the new credentials. The pod environment did not. Something else was setting MONGODB_URI on the container at runtime, and it was winning.

$ kubectl -n app get deploy profile-service \
 -o jsonpath='{.spec.template.spec.containers[0].env}' | jq
[
 { "name": "MONGODB_URI",
 "valueFrom": {
 "configMapKeyRef": {
 "name": "profile-mongodb-config",
 "key": "uri"
 }
 }
 },
 { "name": "PROFILE_MONGODB_URI_OVERRIDE",
 "value": "mongodb://oldapp:oldpw@mongo-old.app.svc:27017/profiles"
 }
]

The override. Set during a migration test six weeks earlier and never removed.

The application code read PROFILE_MONGODB_URI_OVERRIDE if it was set and otherwise read MONGODB_URI. The override had been added during a migration drill six weeks ago, never cleaned up, and was now silently shadowing every ConfigMap update we tried to apply. We have stopped accepting break-glass env overrides on production Deployments for this exact reason. If the override is worth setting, it is worth its own ConfigMap with an expiry annotation that a controller cleans up. Naked env values on the Deployment spec are invisible to the operators who do not know to look for them.

We removed the env var, the pod rolled, and /health came back as 200 on the third pod we curled.

Putting the webhook back, and making sure this never happens the same way again

Restoring failurePolicy: Fail without re-creating the trap

Before we restored failurePolicy to Fail, we fixed the webhook pod. The liveness probe path went from /healthz to /health, the value we had read from the compliance ConfigMap. The pod came up healthy and stayed up. We confirmed the webhook was actually answering by sending a deliberately invalid ConfigMap and watching the validation rejection come back cleanly. Only then did we flip failurePolicy back.

The harder problem was structural. A failurePolicy: Fail webhook that gates ConfigMap writes in a namespace is fine. A failurePolicy: Fail webhook that gates ConfigMap writes in a namespace that also contains the webhook itself, where the webhook's own Deployment depends on ConfigMaps in that namespace, is a bootstrap hazard. The first time something goes wrong, you cannot fix it without breaking your own policy.

👁 The deadlock in one picture. Every fix path routes through a ConfigMap write the webhook itself is blocking.

The deadlock in one picture. Every fix path routes through a ConfigMap write the webhook itself is blocking.

We added two things to the ValidatingWebhookConfiguration before we left. A namespaceSelector that excludes the webhook's own namespace from validation, so the webhook can be rebuilt from in-cluster ConfigMaps even when it is the thing that is broken. And an objectSelector that excludes ConfigMaps carrying a specific break-glass label, so an on-call engineer with the right RBAC can apply a labelled emergency patch without flipping failurePolicy at all. Both changes were reviewed by the compliance team before we merged them, because relaxing the scope of a Fail-policy webhook is itself an audit decision.

The recovery script we left behind reads every value it needs (the health path, the ack token, the affected service list) from cluster state rather than hardcoding. Hardcoded recovery scripts go stale within a quarter; scripts that read from a compliance-owned ConfigMap stay correct as long as the source of truth is maintained. The script is idempotent: rerunning it on an already-recovered cluster is a no-op, which matters because the on-call engineer who runs it at 3 am should not have to think about whether they are the first or the third person to run it that night.

When to call us, and what we will look at first

If your admission layer is gating its own recovery

The thing that makes this incident shape hard is not the webhook itself. It is that the recovery path is non-obvious, the audit consequences of the obvious workaround (flipping policy or deleting the webhook config) are real, and the second-order trap (an env var on a Deployment shadowing the ConfigMap you just fixed) only shows up after you have already burned the credibility from the first workaround. Teams who hit this for the first time usually solve the immediate outage but leave the structural deadlock in place, and then it happens again on a different webhook six months later.

We run these recovery engagements every week. The admission-webhook-blocks-its-own-config shape has come up four times this year for us, once with cert-manager involved, twice with policy webhooks like this one, once with a service mesh sidecar injector that depended on a ConfigMap in its own namespace. The env-override-shadowing-a-ConfigMap-fix pattern is even more common; we see some version of it in roughly half of the credential rotation incidents we are called into.

If your cluster has a Fail-policy admission webhook today and you have never tested what happens when its pod is down, book an infrastructure review with our team and we will start with a 30-minute diagnostic call this week. We will walk your webhook configurations, identify the ones that gate their own dependencies, and give you a concrete plan to break the loop before an incident finds it for you.

Originally published at https://infraforge.agency/insights/admission-webhook-configmap-deadlock-recovery/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

URL: https://dev.to/infraforge/when-a-validating-webhook-blocks-the-configmap-that-would-fix-it-4oe9

⇱ When a validating webhook blocks the ConfigMap that would fix it - DEV Community