Platform Component Replacement¶
In platform engineering, the strangler fig pattern replaces entire components (databases, service meshes, operators, control planes) without routing traffic between old and new systems.
The key difference: You replace the component, not route around it.
Component Replacement vs Traffic Routing
Traffic routing: Gradually shift user traffic from old API to new API Component replacement: Build new component, ensure compatibility, swap it in, remove old component
This is the strangler fig pattern for infrastructure.
When Component Replacement Applies¶
| Scenario | Approach | Why |
|---|---|---|
| Migrating databases | Component replacement | PostgreSQL → PostgreSQL cluster (no app changes) |
| Replacing service mesh | Component replacement | Linkerd → Istio (same abstraction) |
| Upgrading operators | Component replacement | Old CRD version → New CRD version (API compatibility) |
| Swapping storage backends | Component replacement | EBS → EFS (mount point stays same) |
| Replacing user-facing APIs | Traffic routing | REST v1 → REST v2 (gradual user migration) |
| Feature rollout | Traffic routing | Old checkout → New checkout (A/B testing) |
Decision rule: If end users don't directly interact with the component, use component replacement. If they do, use traffic routing.
The Build-Replace-Remove Pattern¶
flowchart TD
subgraph phase1[Phase 1: Build New]
A[Old Component] --> B[Build New Component]
B --> C{Compatible?}
C -->|No| D[Add Compatibility Layer]
C -->|Yes| E[Ready]
D --> E
end
subgraph phase2[Phase 2: Run Parallel]
E --> F[Deploy New Component]
F --> G[Old + New Both Running]
G --> H[Validate New Component]
end
subgraph phase3[Phase 3: Swap]
H --> I[Update References]
I --> J[Point to New Component]
J --> K[Monitor]
end
subgraph phase4[Phase 4: Remove]
K --> L{New Component Stable?}
L -->|Yes| M[Remove Old Component]
L -->|No| N[Rollback to Old]
N --> G
end
%% Ghostty Hardcore Theme
style A fill:#f92672,color:#1b1d1e
style B fill:#fd971e,color:#1b1d1e
style E fill:#a7e22e,color:#1b1d1e
style G fill:#9e6ffe,color:#1b1d1e
style J fill:#65d9ef,color:#1b1d1e
style M fill:#a7e22e,color:#1b1d1e
style N fill:#f92672,color:#1b1d1e
No traffic routing. No percentage-based rollout. Just: build, deploy, swap, remove.
Real-World Example: Database Migration¶
Scenario: PostgreSQL Single Instance → HA Cluster¶
The application connects to PostgreSQL. You want high availability without changing the application.
Traditional Approach (Downtime)¶
# Stop application
# Export database
# Deploy new cluster
# Import database
# Update connection string
# Start application
# Hope nothing broke
Downtime: 2-4 hours. Rollback: Restore from backup.
Component Replacement Approach (Zero Downtime)¶
Phase 1: Build¶
# Deploy new PostgreSQL cluster
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: postgres-ha
spec:
instances: 3
storage:
size: 100Gi
Phase 2: Replicate¶
# Set up logical replication from old to new
# Old database becomes the source
# New cluster replicates in real-time
pg_basebackup -h old-postgres -D /pgdata
Phase 3: Compatibility Layer¶
# Create a Service that points to old database
# Application still uses "postgres" hostname
apiVersion: v1
kind: Service
metadata:
name: postgres
spec:
selector:
app: postgres-old # Still pointing to old
Phase 4: Validate¶
# Monitor replication lag
# Ensure new cluster is in sync
# Run read-only queries against new cluster
# Compare results with old database
Phase 5: Swap¶
# Update Service to point to new cluster
# Application code doesn't change
# Connection string stays "postgres:5432"
apiVersion: v1
kind: Service
metadata:
name: postgres
spec:
selector:
app: postgres-ha # Now pointing to new cluster
Swap happens in seconds. Application reconnects automatically.
Phase 6: Monitor¶
# Watch application metrics
# Check database query latency
# Monitor error rates
# If issues: revert Service to old database
Phase 7: Remove Old¶
Downtime: Zero. Rollback: Change Service selector back to old database.
Additional Resources¶
- Platform Component Examples - Service mesh, operator, and storage migrations
- Compatibility Layers - Service abstraction, API gateway, conversion webhooks, database views
- Validation and Rollback - Validation checklists, rollback strategies, monitoring
- Edge Cases and Comparison - When NOT to use, comparison with traffic routing, gotchas
The new PostgreSQL cluster ran for 3 weeks in parallel. Replication lag stayed under 100ms. The Service selector changed at 2 AM on a Friday. Applications reconnected within 30 seconds. Error rates stayed flat. After 2 weeks of monitoring, the old cluster was decommissioned. Total downtime: zero minutes.