High Availability¶
Production event systems require redundancy. Single points of failure cause outages. High availability (HA) configurations ensure events continue flowing even when components fail.
HA Architecture¶
Redundant deployments across all components:
flowchart TB
subgraph EventSources
ES1[EventSource Pod 1]
ES2[EventSource Pod 2]
end
subgraph EventBus
EB1[JetStream Pod 1]
EB2[JetStream Pod 2]
EB3[JetStream Pod 3]
end
subgraph Sensors
S1[Sensor Pod 1]
S2[Sensor Pod 2]
end
ES1 --> EB1
ES2 --> EB2
EB1 <--> EB2
EB2 <--> EB3
EB3 <--> EB1
EB1 --> S1
EB2 --> S2
%% Ghostty Hardcore Theme
style ES1 fill:#fd971e,color:#1b1d1e
style ES2 fill:#fd971e,color:#1b1d1e
style EB1 fill:#515354,color:#f8f8f3
style EB2 fill:#515354,color:#f8f8f3
style EB3 fill:#515354,color:#f8f8f3
style S1 fill:#f92572,color:#1b1d1e
style S2 fill:#f92572,color:#1b1d1e
EventBus HA¶
JetStream clustering provides EventBus redundancy:
apiVersion: argoproj.io/v1alpha1
kind: EventBus
metadata:
name: default
spec:
jetstream:
version: "2.9.11"
replicas: 3
persistence:
accessMode: ReadWriteOnce
storageClassName: standard
volumeSize: 20Gi
# Anti-affinity for node distribution
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
eventbus-name: default
topologyKey: kubernetes.io/hostname
Key HA settings:
| Setting | Purpose |
|---|---|
replicas: 3 |
Raft consensus requires odd number, 3 minimum |
persistence |
Survives pod restarts |
podAntiAffinity |
Spreads pods across nodes |
EventSource HA¶
Scale EventSources for redundancy:
apiVersion: argoproj.io/v1alpha1
kind: EventSource
metadata:
name: webhook
spec:
replicas: 2
template:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
eventsource-name: webhook
topologyKey: kubernetes.io/hostname
webhook:
endpoint:
port: "12000"
endpoint: /events
With multiple replicas, a Service load balances incoming webhooks across pods.
Sensor HA¶
Sensors can also run multiple replicas:
apiVersion: argoproj.io/v1alpha1
kind: Sensor
metadata:
name: processor
spec:
replicas: 2
template:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
sensor-name: processor
topologyKey: kubernetes.io/hostname
dependencies:
- name: event
eventSourceName: source
eventName: event
triggers:
- template:
name: process
argoWorkflow:
# ...
Multiple Sensor replicas share event processing. JetStream ensures each event is processed exactly once across all replicas.
Pod Disruption Budgets¶
Prevent all pods from being evicted simultaneously:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: eventbus-pdb
spec:
minAvailable: 2
selector:
matchLabels:
eventbus-name: default
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: sensor-pdb
spec:
minAvailable: 1
selector:
matchLabels:
sensor-name: processor
PDBs ensure cluster operations (node drains, upgrades) don't cause outages.
Cross-Zone Deployment¶
Spread components across availability zones:
spec:
template:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: eventbus
topologyKey: topology.kubernetes.io/zone
Zone failures won't take down the entire event system.
Health Checks¶
Configure proper health probes:
spec:
template:
container:
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Kubernetes automatically replaces unhealthy pods and removes them from service load balancing.
Graceful Shutdown¶
Ensure events aren't lost during pod termination:
Pods get 60 seconds to finish processing before forced termination. Events in-flight are acknowledged or returned to the queue.
HA Costs
High availability requires more resources: 3x EventBus replicas, 2x EventSource replicas, 2x Sensor replicas. Size your cluster appropriately and monitor resource usage.
Testing HA¶
Validate your HA setup:
# Kill a random EventBus pod
kubectl delete pod -n argo-events -l eventbus-name=default --wait=false
# Verify events still flow
kubectl logs -n argo-events -l sensor-name=processor --tail=10
# Check EventBus cluster health
kubectl exec -n argo-events eventbus-default-0 -- nats server check cluster
If events stop flowing when a single pod dies, your HA configuration isn't working.
Related¶
- EventBus Configuration - Basic EventBus setup
- Backpressure Handling - Handle high load
- Retry Strategies - Recover from transient failures