Blast Radius Control¶
Uncontrolled chaos destroys production. Controlled chaos reveals fragility.
Targeting Strategies¶
Namespace Isolation¶
Run experiments in dedicated chaos namespaces:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
namespace: chaos-testing
name: pod-deletion-staging
spec:
action: pod-kill
mode: fixed
value: 1
selector:
namespaces:
- staging
- staging-services
labelSelectors:
app: api-gateway
Label-Based Scoping¶
Tag experiments with blast radius constraints:
Time Windows¶
Restrict chaos to low-impact periods:
Progressive Intensity¶
Start small, escalate based on observability:
# Day 1: Single pod, 30 seconds
mode: fixed
value: 1
duration: 30s
# Day 2: Two pods, 1 minute
mode: fixed
value: 2
duration: 1m
# Day 7: 25% during business hours
mode: percentage
value: 25
duration: 5m
Real-World Progression
A SaaS company started chaos with 1 pod for 30 seconds in staging. After 2 weeks of validation, they progressed to 10% of production pods for 5 minutes. After 6 weeks, they ran continuous steady-state chaos at 5% intensity.
Automatic Rollback¶
Configure automated rollback to prevent cascade failures:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-deletion-with-rollback
spec:
action: pod-kill
duration: "2m"
selector:
namespaces:
- production
labelSelectors:
tier: web
scheduling:
# Auto-rollback if error rate exceeds threshold
backoffBase: 1m
maxAttempts: 3
Monitor error rates and auto-cancel:
# Example monitoring script
ERROR_RATE=$(curl -s prometheus:9090/api/v1/query \
--data-urlencode 'query=rate(http_requests_total{status=~"5.."}[1m])' | \
jq '.data.result[0].value[1] | tonumber')
if (( $(echo "$ERROR_RATE > 0.02" | bc -l) )); then
kubectl delete podchaos -n chaos-testing pod-deletion-with-rollback
echo "ROLLBACK: Error rate exceeded 2%"
fi
Execution Best Practices¶
Start small¶
# Test in staging first
mode: fixed
value: 1
duration: 30s
# Only move to production after staging validation
mode: percentage
value: 10
duration: 5m
Observe continuously¶
# Open observability dashboard before starting
kubectl port-forward -n monitoring svc/prometheus 9090:9090
# Visit http://localhost:9090 with key queries pre-loaded
# Open logs aggregator
kubectl port-forward -n logging svc/loki 3100:3100
# Monitor chaos status live
watch kubectl get podchaos,networkchaos -n chaos-testing
Stop at the first sign of trouble¶
# If alert fires unexpectedly, delete chaos resource immediately
kubectl delete podchaos -n chaos-testing <name>
# Verify rollback worked
kubectl get pods -n production
# Expected: All replicas running and ready
Safety Checklist¶
Before running any experiment:
- [ ] Experiment documented with owner and rollback procedure
- [ ] On-call team notified of chaos window
- [ ] Blast radius validated (namespace, labels, pod count)
- [ ] Rollback tested in staging
- [ ] SLI dashboards visible and alert thresholds set
- [ ] No ongoing production incidents
- [ ] Low-traffic window selected
- [ ] Escalation path established
Related Documentation¶
- Back to Overview - Chaos engineering introduction
- Validation Patterns - SLI monitoring and testing
- Experiment Catalog - Ready-to-use scenarios