Retry Strategies¶

Transient failures like network timeouts, temporary service unavailability, and rate limits are common in distributed systems. Retry strategies automatically recover from these failures without manual intervention. For the complete reference, see the official Trigger Retry docs.

Retry Configuration¶

Add retry behavior to any trigger:

triggers:
  - template:
      name: api-call
      http:
        url: https://api.example.com/webhook
        method: POST
    retryStrategy:
      steps: 5
      duration: 5s
      factor: 2
      jitter: 0.2

Retry parameters:

Field	Purpose	Example
`steps`	Maximum retry attempts	`5`
`duration`	Initial delay between retries	`5s`
`factor`	Multiplier for exponential backoff	`2`
`jitter`	Random variation (0-1) to prevent thundering herd	`0.2`

Exponential Backoff¶

With the configuration above, retry timing follows this pattern:

Attempt	Base Delay	With Jitter (±20%)
1	5s	4-6s
2	10s	8-12s
3	20s	16-24s
4	40s	32-48s
5	80s	64-96s

Total maximum wait: ~3 minutes before giving up.

Workflow Trigger Retry¶

For Argo Workflow triggers, configure retry at both levels:

triggers:
  - template:
      name: deploy
      argoWorkflow:
        operation: submit
        source:
          resource:
            apiVersion: argoproj.io/v1alpha1
            kind: Workflow
            spec:
              templates:
                - name: main
                  # Workflow-level retry for task failures
                  retryStrategy:
                    limit: 3
                    backoff:
                      duration: "10s"
                      factor: 2
    # Trigger-level retry for submission failures
    retryStrategy:
      steps: 3
      duration: 5s
      factor: 2

Two retry scopes:

Trigger retry: Handles failures submitting the workflow (API errors, admission webhook failures)
Workflow retry: Handles failures inside the workflow (task errors, container crashes)

HTTP Trigger Retry¶

HTTP endpoints often return transient errors:

triggers:
  - template:
      name: notify-slack
      http:
        url: https://hooks.slack.com/services/XXX
        method: POST
        payload:
          - src:
              dependencyName: event
              dataKey: body.message
            dest: text
    retryStrategy:
      steps: 5
      duration: 2s
      factor: 2
      jitter: 0.3

This handles Slack's rate limiting and temporary outages gracefully.

Kubernetes Resource Retry¶

K8s API calls can fail due to conflicts or temporary unavailability:

triggers:
  - template:
      name: create-configmap
      k8s:
        operation: create
        source:
          resource:
            apiVersion: v1
            kind: ConfigMap
            metadata:
              generateName: event-data-
            data:
              payload: ""
    retryStrategy:
      steps: 3
      duration: 1s
      factor: 2

Short initial delays work well for K8s API retries since failures are usually brief.

When Retries Exhaust¶

After all retry attempts fail, the event is dropped by default. To preserve failed events:

Dead Letter Queue: Route failed events to a separate topic for later processing
Error Workflows: Trigger an error-handling workflow on final failure
Alerting: Send notifications when retries exhaust

See Dead Letter Queues for failed event handling.

Retry vs. Idempotency¶

Retries can cause duplicate processing. If a trigger succeeds but the acknowledgment fails, the event may be processed again. Your workflows must handle this:

# Idempotent workflow design
spec:
  templates:
    - name: deploy
      script:
        source: |
          # Check if already deployed
          CURRENT=$(kubectl get deployment my-app -o jsonpath='{.spec.template.spec.containers[0].image}')
          if [ "$CURRENT" == "{{inputs.parameters.image}}" ]; then
            echo "Already at target version, skipping"
            exit 0
          fi
          # Proceed with deployment
          kubectl set image deployment/my-app app={{inputs.parameters.image}}

Retry Budgets

Calculate your retry budget: steps × duration × factor^steps. Long retry chains can delay subsequent events. Keep total retry time under your acceptable latency threshold.

Dead Letter Queues - Handle exhausted retries
Backpressure Handling - Prevent overload during retries
Workflow Retry Strategy - Workflow-level retries
Official Retry Docs - Complete reference