SLI Monitoring During Chaos¶
Link chaos experiments to Service Level Indicators for measurable validation.
SLI Validation Checklist
- [ ] Baseline metrics captured before chaos
- [ ] Live metrics tracked during chaos
- [ ] Recovery metrics measured after chaos
- [ ] Comparison shows system returned to baseline
- [ ] No degradation persists after experiment ends
Baseline Measurement¶
Capture pre-chaos metrics:
- name: get-baseline
template: query-slis
arguments:
parameters:
- name: window
value: "5m"
- name: metrics
value: "error_rate,p99_latency,availability"
Example baseline query:
Live Monitoring¶
Track metrics during chaos injection:
- name: validate-during-chaos
template: query-slis
arguments:
parameters:
- name: window
value: "live"
- name: comparison
value: "baseline"
Alert if metrics exceed thresholds:
# Alert on excessive error rate
(
rate(http_requests_total{status=~"5.."}[1m]) > 0.02
) and (
chaos_experiment_active == 1
)
Recovery Validation¶
Measure post-chaos recovery:
- name: measure-recovery
template: query-slis
arguments:
parameters:
- name: window
value: "5m-post"
- name: baseline
value: "{{ steps.get-baseline.outputs.metrics }}"
Verify metrics return to baseline:
# Recovery check: error rate back to baseline
abs(
rate(http_requests_total{status=~"5.."}[1m])
- baseline_error_rate
) < 0.001
Key Metrics to Track¶
Service Level Indicators¶
| Metric | Query | Threshold |
|---|---|---|
| Error Rate | rate(http_requests_total{status=~"5.."}[1m]) |
< 0.5% during chaos |
| P99 Latency | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[1m])) |
< 2× baseline |
| Availability | avg_over_time(up[5m]) |
> 99.5% |
| Throughput | rate(http_requests_total[1m]) |
> 80% baseline |
Infrastructure Metrics¶
| Metric | Query | Purpose |
|---|---|---|
| Pod Restarts | kube_pod_container_status_restarts_total |
Detect crash loops |
| CPU Usage | rate(container_cpu_usage_seconds_total[1m]) |
Resource saturation |
| Memory Usage | container_memory_usage_bytes |
Memory leaks or pressure |
| Network Errors | rate(node_network_receive_errs_total[1m]) |
Network degradation |
Monitoring Workflow¶
Pre-Chaos (T-5m to T0)¶
- Query baseline metrics
- Store results for comparison
- Verify metrics are stable
- Confirm no ongoing incidents
# Capture baseline
BASELINE_ERROR_RATE=$(promtool query instant \
'rate(http_requests_total{status=~"5.."}[1m])' \
--time=$(date -u +%s))
echo "Baseline error rate: $BASELINE_ERROR_RATE"
During Chaos (T0 to T+duration)¶
- Track live metrics every 10s
- Compare to baseline thresholds
- Alert if thresholds exceeded
- Trigger rollback if critical
# Monitor during chaos
while chaos_active; do
ERROR_RATE=$(promtool query instant \
'rate(http_requests_total{status=~"5.."}[1m])')
if [ "$ERROR_RATE" > 0.02 ]; then
echo "ALERT: Error rate exceeded threshold"
trigger_rollback
fi
sleep 10
done
Post-Chaos (T+duration to T+duration+5m)¶
- Measure recovery metrics
- Compare to baseline
- Verify full recovery
- Document results
# Verify recovery
RECOVERY_ERROR_RATE=$(promtool query instant \
'rate(http_requests_total{status=~"5.."}[1m])' \
--time=$(date -u +%s))
if [ "$RECOVERY_ERROR_RATE" == "$BASELINE_ERROR_RATE" ]; then
echo "SUCCESS: System recovered to baseline"
else
echo "WARNING: System has not fully recovered"
fi
Metric Collection Patterns¶
Pattern 1: Prometheus Queries¶
apiVersion: v1
kind: ConfigMap
metadata:
name: chaos-queries
data:
error-rate.prom: |
rate(http_requests_total{status=~"5.."}[1m])
p99-latency.prom: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[1m]))
availability.prom: |
avg_over_time(up{job="api-server"}[5m])
Pattern 2: Grafana Dashboard¶
Create a dedicated chaos experiment dashboard:
- Panel 1: Error rate (baseline vs. current)
- Panel 2: Latency percentiles (P50, P95, P99)
- Panel 3: Throughput (requests/second)
- Panel 4: Infrastructure metrics (CPU, memory, restarts)
Pattern 3: Alerting Rules¶
groups:
- name: chaos-monitoring
interval: 10s
rules:
- alert: ChaosErrorRateExceeded
expr: |
rate(http_requests_total{status=~"5.."}[1m]) > 0.02
and chaos_experiment_active == 1
for: 30s
annotations:
summary: "Chaos experiment causing excessive errors"
Related Topics¶
- Success Criteria - Define metric thresholds
- Validation Patterns - Verify experiment outcomes
- Observability - Complete monitoring setup
Metrics are the language of chaos engineering. Without measurement, chaos is just noise.