The Chaos That Proved We Were Ready
You can test incident response in only two ways: during an actual incident (catastrophically late), or before one happens (the entire point of chaos engineering).
We chose the latter. And it saved us.
You can test incident response in only two ways: during an actual incident (catastrophically late), or before one happens (the entire point of chaos engineering).
We chose the latter. And it saved us.
The OpenSSF Best Practices Passing badge doesn't mandate a specific coverage percentage.
We set our bar at 95% minimum. Above even Gold (90%). Self-imposed. Strategic.
We started at 0%. We wanted the Passing badge. But we knew something important: it's easier to build high standards into a young project than retrofit them later. When we go for Gold, 95% would already be habit.
The test suite was comprehensive. Table-driven tests. Error paths covered. Race detector enabled.
Coverage: 85.8%. Stuck.
We needed 95%. The team wrote more tests. Coverage stayed at 85%.
The problem wasn't the tests. It was the code.
The deployment worked in dev. Unit tests passed. Code review approved. Merged to main.
Production exploded.
The config used a dev secret. Migration locked tables at scale. Feature flag on in staging. Off in prod.
Environmental differences killed us.
"The code looks correct" is not the same as "I ran the code and it works."
This distinction costs engineering teams hours every week. Failed CI runs. Broken deployments. Embarrassing rollbacks. All because someone pattern-matched instead of verified.
Here's a framework for building verification into your workflow.