Design Principles¶

Architectural guidance for building resilient automation.

Principles answer the why and when questions. They're decision frameworks, not code snippets.

Overview¶

Principle	When to Apply	Trade-off
Graceful Degradation	System has fallback options	Complexity vs availability
Fail Fast	Early detection prevents cascading failure	Speed vs thoroughness
Prerequisite Checks	Operations have preconditions	Latency vs correctness

Graceful Degradation¶

Principle

When the optimal path fails, fall back to progressively more expensive but reliable alternatives.

The Tiered Fallback Pattern¶

flowchart LR
    A[Request] --> B{Tier 1?}
    B -->|Success| C[Fast Response]
    B -->|Fail| D{Tier 2?}
    D -->|Success| E[Slower Response]
    D -->|Fail| F[Tier 3]
    F --> G[Guaranteed Response]

    style A fill:#65d9ef,color:#1b1d1e
    style B fill:#fd971e,color:#1b1d1e
    style C fill:#a7e22e,color:#1b1d1e
    style D fill:#fd971e,color:#1b1d1e
    style E fill:#a7e22e,color:#1b1d1e
    style F fill:#f92572,color:#1b1d1e
    style G fill:#a7e22e,color:#1b1d1e

Real-World Example¶

From the deployment automation blog post:

Tier	Method	Latency	Fallback Trigger
1	Volume mount	1-5ms	Mount not available
2	API call	50-200ms	API error
3	Full rebuild	5-10s	Always succeeds

When to Apply¶

System has multiple ways to get the same result
Availability is more important than consistency
Degraded service is better than no service

Anti-Patterns¶

Silent degradation - Falling back without logging
No final tier - Every fallback can fail
Expensive default - Using Tier 3 as the happy path

Fail Fast¶

Principle

Detect and report problems as early as possible, before they cascade into larger failures.

When to Apply¶

Invalid input would cause downstream failures
Resources are expensive to allocate
Partial execution leaves inconsistent state

When NOT to Apply¶

Fallback options exist (use graceful degradation instead)
Transient failures are expected (use retry instead)
Partial success is acceptable

Example¶

# Fail fast: Check permissions before starting
- name: Validate access
  run: |
    gh auth status || exit 1
    gh repo view ${{ github.repository }} || exit 1

# Now proceed with actual work
- name: Create release
  run: gh release create v1.0.0

Prerequisite Checks¶

Principle

Validate all preconditions before executing expensive or irreversible operations.

When to Apply¶

Operations are expensive (time, money, resources)
Operations are irreversible (deletes, deployments)
Multiple preconditions must all be true

Example¶

# Check all prerequisites before deployment
check_prerequisites() {
  # Required tools
  command -v kubectl >/dev/null || { echo "kubectl not found"; return 1; }
  command -v helm >/dev/null || { echo "helm not found"; return 1; }

  # Required access
  kubectl auth can-i create deployments || { echo "No deploy permission"; return 1; }

  # Required state
  helm status my-release >/dev/null 2>&1 || { echo "Release not found"; return 1; }

  echo "All prerequisites met"
}

check_prerequisites || exit 1
# Now safe to proceed

Principle Interactions¶

Principles sometimes conflict. Here's how to choose:

Scenario	Choose	Because
Recoverable error with fallback	Graceful degradation	Better UX than failing
Unrecoverable error	Fail fast	Prevent cascade
Expensive operation	Prerequisite check	Avoid wasted work
User-facing service	Graceful degradation	Availability matters
Data integrity operation	Fail fast	Consistency matters

Principles are guardrails, not rules. Context determines which one wins.