Blast Radius Control¶
Uncontrolled chaos destroys production. Controlled chaos reveals fragility.
Real-World Blast Radius
A SaaS company started chaos with 1 pod for 30 seconds in staging. After 2 weeks of validation, they progressed to 10% of production pods for 5 minutes. After 6 weeks, they ran continuous steady-state chaos at 5% intensity.
Targeting Strategies¶
Namespace Isolation¶
Run experiments in dedicated chaos namespaces:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
namespace: chaos-testing
name: pod-deletion-staging
spec:
action: pod-kill
mode: fixed
value: 1
selector:
namespaces:
- staging
- staging-services
labelSelectors:
app: api-gateway
Label-Based Scoping¶
Tag experiments with blast radius constraints:
Time Windows¶
Restrict chaos to low-impact periods:
Progressive Intensity¶
Start small, escalate based on observability:
# Day 1: Single pod, 30 seconds
mode: fixed
value: 1
duration: 30s
# Day 2: Two pods, 1 minute
mode: fixed
value: 2
duration: 1m
# Day 7: 25% during business hours
mode: percentage
value: 25
duration: 5m
Automatic Rollback¶
Configure automated rollback to prevent cascade failures:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-deletion-with-rollback
spec:
action: pod-kill
duration: "2m"
selector:
namespaces:
- production
labelSelectors:
tier: web
scheduling:
# Auto-rollback if error rate exceeds threshold
backoffBase: 1m
maxAttempts: 3
Monitor error rates and auto-cancel:
# Example monitoring script
ERROR_RATE=$(curl -s prometheus:9090/api/v1/query \
--data-urlencode 'query=rate(http_requests_total{status=~"5.."}[1m])' | \
jq '.data.result[0].value[1] | tonumber')
if (( $(echo "$ERROR_RATE > 0.02" | bc -l) )); then
kubectl delete podchaos -n chaos-testing pod-deletion-with-rollback
echo "ROLLBACK: Error rate exceeded 2%"
fi
Blast Radius Constraints¶
Development Environment¶
- Target: Individual services, unlimited blast radius
- Duration: No restrictions
- Approval: Self-service by engineers
Staging Environment¶
- Target: Up to 50% of instances
- Duration: Up to 1 hour
- Approval: Team lead approval
Production Environment¶
- Target: Start with 1 instance, max 10% after validation
- Duration: 5 minutes max initially
- Approval: Security and SRE approval required
Production Chaos Rules¶
Production Chaos Rules
- Never run chaos during high-traffic periods
- Always have rollback plan tested in staging first
- Start with smallest blast radius possible
- Monitor continuously during experiment
- Stop immediately if metrics exceed thresholds
Pre-Production Validation¶
Before running chaos in production:
- [ ] Experiment succeeds in staging 3+ times
- [ ] Rollback procedure tested and documented
- [ ] On-call team notified with exact time window
- [ ] Monitoring dashboards prepared
- [ ] Alert thresholds configured
- [ ] Escalation path established
Production Safeguards¶
- Time Window: Low-traffic periods only (2-6 AM)
- Blast Radius: Maximum 10% of capacity
- Duration: 5 minutes maximum
- Monitoring: Real-time SLI tracking
- Rollback: Automated based on error rate
Blast Radius Calculation¶
Formula¶
User Impact Factor:
- 1.0: No customer-facing impact
- 2.0: Indirect customer impact (internal API)
- 5.0: Direct customer impact (web frontend)
- 10.0: Payment or authentication systems
Example Calculations¶
Scenario 1: Background Worker¶
Affected: 2 of 10 worker pods
Total: 10 workers
User Impact: 1.0 (async processing)
Blast Radius: (2/10) × 1.0 = 0.2 (20%, low risk)
Scenario 2: API Gateway¶
Affected: 1 of 3 API pods
Total: 3 pods
User Impact: 5.0 (customer-facing)
Blast Radius: (1/3) × 5.0 = 1.67 (167%, high risk)
Scenario 3: Payment Processor¶
Affected: 1 of 5 payment pods
Total: 5 pods
User Impact: 10.0 (critical path)
Blast Radius: (1/5) × 10.0 = 2.0 (200%, critical risk)
Related Topics¶
- Hypothesis Formation - Define experiment scope
- Success Criteria - Define acceptable impact
- SLI Monitoring - Track impact in real-time
Controlled blast radius is the difference between learning and breaking. Start small, validate, then scale.