Monitoring and Rollback¶
Monitor both systems in parallel. Compare metrics. Rollback instantly if new system degrades.
Migration Pattern
This guide covers the Strangler Fig pattern for incremental system migration. Review all sections for complete implementation strategy.
Monitoring During Migration¶
Track both systems:
# Prometheus queries
# Legacy request rate
rate(http_requests_total{service="api-legacy"}[5m])
# New system request rate
rate(http_requests_total{service="api-v2"}[5m])
# Error rate comparison
rate(http_requests_total{service="api-legacy",status="500"}[5m])
/
rate(http_requests_total{service="api-legacy"}[5m])
vs
rate(http_requests_total{service="api-v2",status="500"}[5m])
/
rate(http_requests_total{service="api-v2"}[5m])
Alert if new system error rate > legacy error rate. Rollback flag immediately.
Key Metrics to Track¶
| Metric | Why It Matters | Threshold |
|---|---|---|
| Error Rate | New system breaking requests | New ≤ Legacy |
| P95 Latency | Performance regression | New ≤ Legacy × 1.2 |
| Throughput | Capacity handling | New ≥ Legacy |
| Resource Usage | Cost implications | Monitor CPU/memory |
| Database Lag | Dual write consistency | < 100ms between DBs |
Rollback Strategy¶
Feature flags enable instant rollback:
# Increase traffic to new system
curl -X PATCH https://api.launchdarkly.com/flags/new-api-v2 \
-d '{"variations": [{"value": true, "weight": 50000}]}' # 50%
# Monitor metrics...
# Rollback if issues
curl -X PATCH https://api.launchdarkly.com/flags/new-api-v2 \
-d '{"variations": [{"value": true, "weight": 0}]}' # 0%
No deployment needed. Instant traffic shift.
Rollback Decision Tree¶
flowchart TD
A[Deploy New System] --> B{Error Rate Increase?}
B -->|Yes| C[Rollback Immediately]
B -->|No| D{Latency Increase?}
D -->|>20%| C
D -->|No| E{Monitor 24 Hours}
E --> F{Stable?}
F -->|No| C
F -->|Yes| G[Increase Traffic]
%% Ghostty Hardcore Theme
style A fill:#65d9ef,color:#1b1d1e
style B fill:#fd971e,color:#1b1d1e
style C fill:#f92572,color:#1b1d1e
style D fill:#fd971e,color:#1b1d1e
style E fill:#a7e22e,color:#1b1d1e
style F fill:#fd971e,color:#1b1d1e
style G fill:#a7e22e,color:#1b1d1e
Error rate spike: Immediate rollback. Latency degradation >20%: Rollback. Stable for 24 hours: Proceed.
Automated Rollback¶
Automate rollback based on metrics:
# Prometheus alerting rule
groups:
- name: strangler_rollback
rules:
- alert: NewSystemHighErrorRate
expr: |
(
rate(http_requests_total{service="api-v2",status="500"}[5m])
/
rate(http_requests_total{service="api-v2"}[5m])
)
>
(
rate(http_requests_total{service="api-legacy",status="500"}[5m])
/
rate(http_requests_total{service="api-legacy"}[5m])
)
for: 5m
annotations:
summary: "New system error rate exceeds legacy"
description: "Rollback new-api-v2 feature flag"
Alert triggers runbook. Rollback initiated automatically or manually depending on severity.
Observability Checklist¶
Before increasing traffic:
- [ ] Logs aggregated for both systems
- [ ] Metrics dashboards compare legacy vs new
- [ ] Error tracking (Sentry, Rollbar) configured
- [ ] Alerting rules deployed
- [ ] Runbook documented for rollback
- [ ] On-call team notified of migration phase
Don't proceed without observability in place.
Related Patterns¶
- Traffic Routing - Control traffic flow
- Implementation Strategies - Feature flags and shadow mode
- Migration Guide - Phased rollout timeline
Monitoring caught the latency spike at 10% traffic. Feature flag rolled back to 0% in 30 seconds. Root cause identified: database connection pool too small. Fixed, redeployed, resumed migration. Zero customer impact.