Retry Strategy¶

Transient failures are inevitable in distributed systems. API servers become temporarily unavailable. Rate limits kick in during bursts. Network partitions resolve themselves after seconds. A well-designed retry strategy handles these without human intervention.

Why Retry Matters¶

The first instinct when a workflow fails is to check what went wrong. But many failures fix themselves. A Kubernetes API server might return a 503 for a few seconds during a rolling update. A rate-limited external API recovers after the quota window resets.

Without retry logic, these transient issues become permanent failures. Someone has to notice, investigate, and manually re-trigger the workflow. This breaks the promise of automation. The system should handle routine failures on its own.

The flip side is that not everything should be retried. An RBAC permission denied error won't fix itself. Invalid parameters won't become valid. A deleted resource won't reappear. Retrying these wastes time and can mask real problems.

Configuration¶

templates:
  - name: restart-deployments
    retryStrategy:
      limit: 3
      retryPolicy: Always
      backoff:
        duration: "5s"
        factor: 2
        maxDuration: "1m"

The backoff configuration implements exponential backoff: first retry after 5 seconds, second after 10 seconds, third after 20 seconds. This gives transient issues time to resolve while avoiding tight retry loops that can worsen the problem.

Configuration options:

Field	Purpose	Example
`limit`	Maximum retry attempts	`3`
`retryPolicy`	When to retry	`Always`, `OnFailure`, `OnError`
`backoff.duration`	Initial wait time	`"5s"`
`backoff.factor`	Multiplier for each retry	`2`
`backoff.maxDuration`	Cap on wait time	`"1m"`

Retry Policies¶

The retryPolicy field controls which failures trigger retries:

Always: Retry on any failure: script errors, container crashes, and timeouts. Use this when you can't predict failure modes and want maximum resilience.

OnFailure: Retry only when the container exits with a non-zero code. System errors (like pod eviction) don't trigger retries. Use this when you trust your script to handle transient issues internally.

OnError: Retry only on system errors, not script failures. Use this when script failures represent permanent problems that shouldn't be retried.

When to Retry¶

Failure Type	Retry?	Why
API rate limits	Yes	Backoff gives quota time to reset
Network timeouts	Yes	Transient by nature
5xx server errors	Yes	Usually temporary
Pod eviction	Yes	Cluster pressure is temporary
RBAC denied	No	Won't fix itself
Invalid parameters	No	Need human correction
Resource not found	Maybe	Depends on whether it might appear

The "maybe" category requires judgment. If your workflow expects a resource that another workflow creates, a brief retry period makes sense. The resource might appear momentarily. But if the resource should already exist, failing fast is better.

Backoff Tuning¶

The backoff configuration balances responsiveness against load:

flowchart LR
    A[Fail] -->|5s| B[Retry 1]
    B -->|Fail| C[10s wait]
    C -->|Retry 2| D[Fail]
    D -->|20s wait| E[Retry 3]
    E -->|Fail| F[Permanent Failure]

    %% Ghostty Hardcore Theme
    style A fill:#f92572,color:#1b1d1e
    style B fill:#fd971e,color:#1b1d1e
    style C fill:#65d9ef,color:#1b1d1e
    style D fill:#fd971e,color:#1b1d1e
    style E fill:#65d9ef,color:#1b1d1e
    style F fill:#f92572,color:#1b1d1e

Aggressive backoff (short duration, low factor): Faster recovery from brief blips, but more load on failing systems. Use for internal APIs that can handle the traffic.

Conservative backoff (long duration, high factor): Slower recovery, but gentler on systems under stress. Use for external APIs with rate limits.

Retry Limits

Set limit based on your tolerance for delay. Three retries with exponential backoff can take over a minute. If your workflow is time-sensitive, consider fewer retries with shorter backoff.

Per-Step Retries¶

Different steps can have different retry strategies:

templates:
  - name: main
    steps:
      - - name: fetch-data
          template: fetch-data
      - - name: process-data
          template: process-data

  - name: fetch-data
    retryStrategy:
      limit: 5
      backoff:
        duration: "10s"
        factor: 2
    container:
      image: curlimages/curl
      command: [curl, "{{inputs.parameters.url}}"]

  - name: process-data
    retryStrategy:
      limit: 2
      backoff:
        duration: "5s"
        factor: 1
    container:
      image: processor:latest
      command: [process, /data/input.json]

The fetch-data step uses aggressive retries because network requests are prone to transient failures. The process-data step uses minimal retries because processing failures are usually permanent. The input is either valid or it isn't.

Combining with Timeouts¶

Retry strategies interact with timeouts. A step with both retry and timeout might:

Run the step
Hit the timeout
Retry
Hit the timeout again
Eventually fail permanently

Consider the total time budget:

spec:
  activeDeadlineSeconds: 300  # 5 minute workflow timeout
  templates:
    - name: process
      retryStrategy:
        limit: 3
        backoff:
          duration: "30s"  # Retry delays: 30s, 60s, 120s = 210s total

With three retries and exponential backoff starting at 30 seconds, the retry delays alone consume 210 seconds. Add actual execution time and you might exceed the 300-second workflow timeout. Always calculate worst-case timing.

Basic Structure - WorkflowTemplate anatomy
Init Containers - Multi-stage setup
Concurrency Control - Preventing parallel execution conflicts