Observability Patterns¶
Key Metrics to Instrument¶
Track Chaos Experiment State
Add a metric for active chaos experiments. This appears in dashboards and alerts. When debugging a 3am incident, you need to know if chaos is running.
# Application-level circuit breaker state
circuit_breaker_state = Gauge(
'circuit_breaker_state',
'Circuit breaker state (0=closed, 1=open, 2=half-open)',
labelnames=['service', 'dependency']
)
# Fallback activation
fallback_activated = Counter(
'fallback_activated_total',
'Total times fallback logic was activated',
labelnames=['feature', 'reason']
)
# Graceful degradation level
degradation_level = Gauge(
'degradation_level',
'Current degradation level (0=full, 1=partial, 2=fallback)',
labelnames=['service']
)
# Recovery time after failure
recovery_time_seconds = Histogram(
'recovery_time_seconds',
'Time to recover from failure injection',
labelnames=['failure_type']
)
Alert Rules for Chaos Validation¶
groups:
- name: chaos_validation
interval: 30s
rules:
- alert: UnexpectedErrorRateDuringChaos
expr: |
(rate(http_requests_total{status=~"5.."}[1m]) > 0.02)
and
(chaos_experiment_active == 1)
for: 1m
labels:
severity: critical
component: observability
annotations:
summary: "Error rate exceeded threshold during chaos experiment"
- alert: CircuitBreakerDidNotOpen
expr: |
(increase(external_api_errors_total[5m]) > 10)
and
(circuit_breaker_state == 0)
for: 30s
labels:
severity: warning
component: reliability
annotations:
summary: "Circuit breaker should have opened but didn't"
Monitoring Integration¶
Prometheus Queries¶
# Error rate during chaos
rate(http_requests_total{status=~"5.."}[1m])
# Latency impact
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[1m]))
# Pod replacement time
min(time() - pod_created_timestamp_seconds) by (pod)
# Circuit breaker state
circuit_breaker_state{service="api"}
# Cache hit rate during degradation
increase(cache_hits_total[5m]) / increase(cache_requests_total[5m])
Grafana Dashboards¶
Key panels to track during chaos:
- Error Rate Timeline - Shows spike during chaos injection
- Latency Percentiles - P50, P95, P99 over experiment duration
- Pod Status - Running, pending, failed pods
- Circuit Breaker State - Open/closed/half-open timeline
- Recovery Metrics - Time to baseline after chaos ends
Chaos Experiment Tracking¶
Mark experiment windows in metrics:
# Set experiment active flag
chaos_experiment_active.set(1)
# Record experiment metadata
chaos_experiment_info.labels(
name="pod-deletion",
blast_radius="1/3",
namespace="production"
).set(1)
# At completion
chaos_experiment_active.set(0)
chaos_experiment_duration.observe(experiment_duration_seconds)
Related Documentation¶
- Back to Overview - Chaos engineering introduction
- Validation Patterns - SLI monitoring and testing
- Operations - Running experiments safely