Resource Chaos Experiments¶

Memory pressure and CPU stress testing to validate resource limit enforcement and graceful degradation.

OOMKill Is Not Graceful Shutdown

When Kubernetes kills a pod for exceeding memory limits, it uses SIGKILL, not SIGTERM. Your application has zero time to clean up. Test memory limits under realistic load.

Experiment 3: Memory Pressure (Resource Exhaustion Testing)¶

Purpose: Verify that the application gracefully handles memory pressure and that Kubernetes properly evicts or restarts memory-intensive containers.

Setup: Service with memory limits and OOM kill guards configured.

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  namespace: chaos-testing
  name: memory-pressure-worker
spec:
  action: stress
  stressors:
    memory:
      workers: 1
      size: "256MB"
  duration: 3m
  selector:
    namespaces:
      - production
    labelSelectors:
      app: background-worker
      memory-stress-target: "true"
  mode: fixed
  value: 1

Expected behavior¶

0-10s: Memory consumption increases gradually
10-30s: Application detects memory pressure, triggers cleanup
30-60s: If memory exceeds soft limits, shed non-critical operations
60-120s: Container memory usage approaches cgroup limit
120-180s: If memory exceeds hard limit, OOMKill triggers pod restart
180s+: Replacement pod schedules and recovers

Success criteria¶

Memory pressure doesn't cause thread exhaustion
Garbage collection frequency increases appropriately
Non-critical background tasks are deferred or cancelled
Pod remains responsive (health checks pass) until hard limit
If OOMKilled, replacement pod starts within 30 seconds
No hung processes or deadlocks during recovery
Persistent state is not corrupted (database checks pass)

Validation queries¶

# Memory consumption stays below limit
container_memory_usage_bytes{pod=~"worker.*"} < 256e6

# GC pause time increases during stress
rate(jvm_gc_duration_seconds_sum[1m]) > 0.01

# Task queue backlog indicates load shedding
background_task_queue_length < previous_value

# Pod restart count increases only if memory exhausted
increase(kube_pod_container_status_restarts_total{pod=~"worker.*"}[5m]) <= 1

# Recovery is fast after memory stress removed
max(container_memory_usage_bytes{pod=~"worker.*"}) - avg(container_memory_usage_bytes{pod=~"worker.*"}) < 50e6

Rollback procedure¶

# Remove memory stress chaos
kubectl delete stresschaos -n chaos-testing memory-pressure-worker

# Verify pod stability
kubectl get pods -n production -l app=background-worker
kubectl describe pod -n production -l app=background-worker | grep -A 5 "Last State"

# Check memory normalizes
kubectl top pods -n production -l app=background-worker

# If pod is stuck, force restart
kubectl rollout restart deployment/background-worker -n production

# Validate data integrity
kubectl exec -n production deployment/background-worker -- \
  sql-validate-schema.sh