Skip to content

The Art of Failing Gracefully: Tiered Fallbacks for CI/CD

My workflow reads deployment mappings from a ConfigMap mounted as a volume. Five milliseconds, zero API calls. But what happens when the mount fails?

The answer isn't "return an error." It's "try the next thing."


The Three-Tier Pattern

The Core Principle

Degrade performance, not availability. Every operation should have a guaranteed fallback.

Every robust system I've built follows the same structure:

flowchart TD
    subgraph request[Request]
        A[Operation Requested]
    end

    subgraph tiers[Fallback Tiers]
        T1[Tier 1: Optimal]
        T2[Tier 2: Acceptable]
        T3[Tier 3: Guaranteed]
    end

    subgraph result[Result]
        Success[Success]
    end

    A --> T1
    T1 -->|Works| Success
    T1 -->|Fails| T2
    T2 -->|Works| Success
    T2 -->|Fails| T3
    T3 --> Success

    %% Ghostty Hardcore Theme
    style A fill:#65d9ef,color:#1b1d1e
    style T1 fill:#a7e22e,color:#1b1d1e
    style T2 fill:#fd971e,color:#1b1d1e
    style T3 fill:#f92572,color:#1b1d1e
    style Success fill:#a7e22e,color:#1b1d1e
  • Tier 1: Fast, cheap, preferred
  • Tier 2: Slower, costlier, reliable
  • Tier 3: Expensive but always works

The key insight: degrade performance, not availability.


Real Numbers

In From 5 Seconds to 5 Milliseconds, I documented how a ConfigMap cache transformed deployment automation. But the story doesn't end at "use a cache." The real pattern is the fallback chain:

Tier Method Latency API Calls
1 Volume mount 1-5ms 0
2 API call 50-200ms 1
3 Cluster scan 5-10s 100+

When Tier 1 works (99% of the time), the system flies. When it doesn't, the system still works. That's the difference between a cache optimization and a reliability pattern.


The Code Pattern

func GetDeployments(image string) ([]Deployment, error) {
    // Tier 1: Try volume mount
    if data, err := os.ReadFile("/etc/cache/deployments.json"); err == nil {
        metrics.RecordTier("mount")
        return parseDeployments(data, image)
    }

    // Tier 2: Try API call
    if data, err := k8s.GetConfigMap("deployment-cache"); err == nil {
        metrics.RecordTier("api")
        return parseDeployments(data, image)
    }

    // Tier 3: Rebuild from cluster scan
    metrics.RecordTier("rebuild")
    return scanClusterForImage(image)
}

Notice the metrics.RecordTier() calls. You need to know which tier is serving traffic. If Tier 1 starts failing, you want to know before your users notice the latency spike.


Fail Fast vs Degrade Gracefully

Decision Rule

Fail fast on precondition failures. Degrade gracefully on runtime failures.

These patterns aren't opposites. They solve different problems:

Scenario Pattern Why
Invalid input Fail Fast User error, report immediately
Missing config Fail Fast Can't continue safely
Cache miss Graceful Degradation Expensive path still works
Network timeout Graceful Degradation Infrastructure issue, retry

Decision rule: Fail fast on precondition failures. Degrade gracefully on runtime failures.

Preconditions are things that should be true before you start. Runtime failures are things that might go wrong during execution.


The Anti-Patterns

I've seen these kill systems:

Silent Degradation

// Dangerous: no observability
if data, _ := cache.Get(); data != nil {
    return data
}
return fetchFromAPI()  // Who knows we're degraded?

If Tier 1 silently fails for a week, you'll only notice when someone asks why the system is slow. By then, you've been burning API quota and adding latency for thousands of requests.

No Guaranteed Tier

// Dangerous: can fail completely
if cfg := cache.Get(); cfg != nil {
    return cfg
}
return api.FetchConfig()  // What if API is also down?

Every chain needs a final tier that cannot fail. A static default. A hardcoded fallback. Something that always returns a valid response, even if it's stale or incomplete.

Expensive Default

# Wasteful: always does the expensive thing
- run: npm ci
- uses: actions/cache/save@v4

If you're always running Tier 3 and then caching the result, you've inverted the pattern. The point is to avoid the expensive path, not run it every time.


Implementation Checklist

Before shipping graceful degradation:

  1. Define all tiers before writing code
  2. Identify the guaranteed tier that always succeeds
  3. Instrument each tier with metrics and logs
  4. Alert on tier shifts (Tier 1 failure rate > 5%)
  5. Test fallback paths in CI, not just production
  6. Document expected latencies per tier
  7. Set SLOs per tier (Tier 1: p99 < 10ms)

Further Reading

The full pattern documentation is in the Developer Guide:



The best systems don't avoid failure. They survive it.

Comments