Skip to content

Kubernetes

The Last Service Account Key

$ git log --all --oneline -- '**/service-account.json' | wc -l
47

$ git log --all --oneline -- '**/service-account.json' | head -1
a3f8c2e delete: remove production service account key

That commit sits in your history like a monument. Not because of what it added, but because of what it finally took away. Forty-seven commits that existed only to move secrets around, rotate them, revoke them, apologize for them, and eventually eliminate them.

That last deletion was the sound of the door closing on an entire class of infrastructure vulnerability.

The Architecture That Couldn't Be Breached

Container escape achieved. Attacker privilege: still none. Why?

The breach happened. The forensics confirmed it. Shellcode executed inside the container. Root user, full system access, network connectivity. All compromised. Everything the attacker needed to pivot was there.

None of it worked.

The escaped container had no network access to other services. Secrets were never mounted into the pod. The attacker had no credential to steal. The host firewall blocked outbound connections. The network policy denied access to the control plane. The RBAC denied any service account permissions.

The container was compromised. The architecture was not.

This is what defense in depth looks like when it actually works.

The GKE Cluster That Nobody Could Break

Day 1 of pentest. Security firm arrives with methodology, tools, and confidence. The plan is simple: find gaps in the Kubernetes cluster, prove impact, deliver a detailed report of findings.

Day 2. They're quiet. Too quiet.

Day 3. Meeting request. Not the kind where they show you their findings.

"We found nothing. Well, nothing critical. Actually, we found nothing at all. This is the best-hardened cluster we've tested. Want to know what you did right?"

That's not how pentest reports usually end.

The 3am Incident That Followed The Playbook

3:17am. The pager vibrates on the nightstand. Half asleep, hand fumbles for phone. The message is three lines. Pod restart storms. API latency spiking. Customers seeing timeouts.

The engineer's first thought isn't "oh god, what now." It's automatic: "open the runbook."

Muscle memory takes over. Hands pull up a laptop still warm from yesterday. The playbook is right there: decision tree, diagnostic steps, escalation paths. No thinking required. Just follow the checklist.

Twenty-three minutes later, the incident is closed. Every step documented. The postmortem writes itself.

This is what happens when you stop improvising and start automating response.

The Policy That Wrote Itself

12 teams. 47 namespaces. 1 security requirement. 0 teams wanted to write policies.

The mandate came down: all workloads need pod security policies. No root containers. No privileged escalation. No host volumes. Standard stuff. Every team got the requirement. Then the work stalled.

Policy-as-Code is powerful. Enforcement at admission time stops bad deployments before they reach etcd. But power has a price: someone has to write YAML.

Team A wrote a policy. 34 lines. Solid.

Team B copy-pasted it. Forgot to update the label selectors. Now it applies to everything, including system services. Everything gets rejected. Team B spends four hours debugging why their monitoring won't deploy.

Team C started from scratch. Different syntax. Nested conditions. Hard to read. Works, mostly.

Team D went with "we'll do it next sprint." Still waiting.

The pattern was obvious: enforcement is easy. Enforcement at scale isn't. Every team writing their own policies means every team makes the same mistakes.

Same mistakes repeated 12 times is an incident waiting to happen.

The CLI That Replaced 47 Shell Scripts

47 shell scripts. 12 CronJobs. Zero test coverage. One production incident that forced a rebuild.

The kubectl plugin pattern couldn't handle our complexity. Shell scripts worked until they didn't. No type safety. No testing. Debugging meant reading logs after failures.

Then came the rewrite: One CLI. 89% test coverage. Runs in CronJobs and Argo Workflows. Distroless container. Multi-arch binaries. Production deployments via Helm.