Design Principles¶
Architectural guidance for building resilient automation.
Principles answer the why and when questions. They're decision frameworks, not code snippets.
Overview¶
| Principle | When to Apply | Trade-off |
|---|---|---|
| Graceful Degradation | System has fallback options | Complexity vs availability |
| Fail Fast | Early detection prevents cascading failure | Speed vs thoroughness |
| Prerequisite Checks | Operations have preconditions | Latency vs correctness |
Graceful Degradation¶
Principle
When the optimal path fails, fall back to progressively more expensive but reliable alternatives.
The Tiered Fallback Pattern¶
flowchart LR
A[Request] --> B{Tier 1?}
B -->|Success| C[Fast Response]
B -->|Fail| D{Tier 2?}
D -->|Success| E[Slower Response]
D -->|Fail| F[Tier 3]
F --> G[Guaranteed Response]
style A fill:#65d9ef,color:#1b1d1e
style B fill:#fd971e,color:#1b1d1e
style C fill:#a7e22e,color:#1b1d1e
style D fill:#fd971e,color:#1b1d1e
style E fill:#a7e22e,color:#1b1d1e
style F fill:#f92572,color:#1b1d1e
style G fill:#a7e22e,color:#1b1d1e
Real-World Example¶
From the deployment automation blog post:
| Tier | Method | Latency | Fallback Trigger |
|---|---|---|---|
| 1 | Volume mount | 1-5ms | Mount not available |
| 2 | API call | 50-200ms | API error |
| 3 | Full rebuild | 5-10s | Always succeeds |
When to Apply¶
- System has multiple ways to get the same result
- Availability is more important than consistency
- Degraded service is better than no service
Anti-Patterns¶
- Silent degradation - Falling back without logging
- No final tier - Every fallback can fail
- Expensive default - Using Tier 3 as the happy path
Fail Fast¶
Principle
Detect and report problems as early as possible, before they cascade into larger failures.
When to Apply¶
- Invalid input would cause downstream failures
- Resources are expensive to allocate
- Partial execution leaves inconsistent state
When NOT to Apply¶
- Fallback options exist (use graceful degradation instead)
- Transient failures are expected (use retry instead)
- Partial success is acceptable
Example¶
# Fail fast: Check permissions before starting
- name: Validate access
run: |
gh auth status || exit 1
gh repo view ${{ github.repository }} || exit 1
# Now proceed with actual work
- name: Create release
run: gh release create v1.0.0
Prerequisite Checks¶
Principle
Validate all preconditions before executing expensive or irreversible operations.
When to Apply¶
- Operations are expensive (time, money, resources)
- Operations are irreversible (deletes, deployments)
- Multiple preconditions must all be true
Example¶
# Check all prerequisites before deployment
check_prerequisites() {
# Required tools
command -v kubectl >/dev/null || { echo "kubectl not found"; return 1; }
command -v helm >/dev/null || { echo "helm not found"; return 1; }
# Required access
kubectl auth can-i create deployments || { echo "No deploy permission"; return 1; }
# Required state
helm status my-release >/dev/null 2>&1 || { echo "Release not found"; return 1; }
echo "All prerequisites met"
}
check_prerequisites || exit 1
# Now safe to proceed
Principle Interactions¶
Principles sometimes conflict. Here's how to choose:
| Scenario | Choose | Because |
|---|---|---|
| Recoverable error with fallback | Graceful degradation | Better UX than failing |
| Unrecoverable error | Fail fast | Prevent cascade |
| Expensive operation | Prerequisite check | Avoid wasted work |
| User-facing service | Graceful degradation | Availability matters |
| Data integrity operation | Fail fast | Consistency matters |
Principles are guardrails, not rules. Context determines which one wins.