Hub and Spoke Operations Guide¶
Scaling Characteristics¶
Quick Start
This guide is part of a modular documentation set. Refer to related guides in the navigation for complete context.
| Aspect | Sequential | Hub and Spoke |
|---|---|---|
| Parallelism | None | Full |
| Time complexity | O(n) | O(1), limited by longest spoke |
| Resource usage | One worker | n workers |
| Failure isolation | One failure stops all | Failures isolated to spokes |
| Debugging | Easy, linear flow | Harder, distributed system |
Failure Handling¶
Spoke failures don't kill the hub:
- name: spawn-spoke
inputs:
parameters:
- name: repo
continueOn:
failed: true # Hub continues if spoke fails
resource:
action: create
manifest: |
spec:
retryStrategy:
limit: 3
backoff:
duration: 30s
factor: 2
Hub spawns all spokes. Failed spokes retry independently. Hub summarizes successes and failures.
Summary Collection¶
Hub aggregates spoke results:
- name: summarize
inputs:
parameters:
- name: spoke-results
script:
image: alpine:latest
command: [sh]
source: |
#!/bin/sh
RESULTS='{{inputs.parameters.spoke-results}}'
TOTAL=$(echo "$RESULTS" | jq 'length')
SUCCESS=$(echo "$RESULTS" | jq '[.[] | select(.status=="success")] | length')
FAILED=$(echo "$RESULTS" | jq '[.[] | select(.status=="failed")] | length')
echo "Total: $TOTAL"
echo "Success: $SUCCESS"
echo "Failed: $FAILED"
if [ "$FAILED" -gt 0 ]; then
echo "Failed repositories:"
echo "$RESULTS" | jq -r '.[] | select(.status=="failed") | .repository'
exit 1
fi
Rate Limiting and Throttling¶
Prevent overwhelming downstream systems. Control spoke execution rate.
Max Parallel Constraint¶
Limit concurrent spokes:
# Argo Workflows: limit parallelism
- name: spawn-spoke
inputs:
parameters:
- name: repo
withParam: "{{workflow.parameters.repositories}}"
parallelism: 10 # Maximum 10 spokes running concurrently
resource:
action: create
manifest: |
spec:
workflowTemplateRef:
name: spoke-worker
GitHub Actions matrix with concurrency:
jobs:
distribute:
runs-on: ubuntu-latest
strategy:
max-parallel: 5 # Only 5 repos processed at once
matrix:
repo: ${{ fromJson(needs.discover.outputs.repositories) }}
steps:
- run: ./process-repo.sh ${{ matrix.repo }}
Batch Processing with Waves¶
Split work into waves:
# Hub splits 100 repos into 10 waves of 10
- name: process-wave
inputs:
parameters:
- name: wave-number
- name: repositories
steps:
- - name: spawn-batch
template: spawn-spoke
arguments:
parameters:
- name: repo
value: "{{item}}"
withParam: "{{inputs.parameters.repositories}}"
parallelism: 10
# Wait between waves
- - name: wait
template: sleep
arguments:
parameters:
- name: duration
value: "30s"
Wave execution prevents API rate limit hits.
Exponential Backoff for Rate Limits¶
Spoke detects rate limit, backs off:
func executeWithBackoff(ctx context.Context, repo string) error {
maxRetries := 5
baseDelay := 1 * time.Second
for attempt := 0; attempt < maxRetries; attempt++ {
err := processRepository(ctx, repo)
if err == nil {
return nil
}
// Check if rate limited
if isRateLimitError(err) {
delay := baseDelay * time.Duration(1<<attempt) // 1s, 2s, 4s, 8s, 16s
log.Printf("Rate limited, backing off for %v", delay)
time.Sleep(delay)
continue
}
// Non-rate-limit error, fail
return err
}
return fmt.Errorf("max retries exceeded for %s", repo)
}
func isRateLimitError(err error) bool {
// GitHub API returns 403 with rate limit headers
if httpErr, ok := err.(*github.RateLimitError); ok {
return true
}
return false
}
Rate Limit Detection and Handling¶
Check rate limits before spawning spokes:
- name: check-rate-limit
script:
image: ghcr.io/cli/cli:latest
command: [bash]
source: |
#!/bin/bash
set -euo pipefail
# Get GitHub API rate limit
LIMIT=$(gh api rate_limit --jq '.rate.remaining')
RESET=$(gh api rate_limit --jq '.rate.reset')
if [ "$LIMIT" -lt 100 ]; then
WAIT=$((RESET - $(date +%s)))
echo "Rate limit low ($LIMIT remaining). Waiting ${WAIT}s"
sleep "$WAIT"
fi
echo "Rate limit OK: $LIMIT requests remaining"
Hub respects rate limits before distributing work.
When to Use This Pattern¶
Use when:
- Work can be parallelized
- Same operation across many targets (repos, deployments, files)
- Scaling matters more than simplicity
- Failures should be isolated
Don't use when:
- Tasks must run sequentially
- Coordination overhead exceeds work duration
- Debugging distributed systems is too complex
Real-World Scenarios¶
Scenario 1: File Distribution¶
Hub discovers 75 repositories. Spawns 75 spoke workflows in parallel. Each spoke creates a PR. Hub summarizes: 70 success, 5 failed (protected branches).
Time: 2 minutes (parallelized) vs 2.5 hours (sequential).
Scenario 2: Deployment Restart¶
Hub receives image push event. Looks up deployments using that image (via ConfigMap cache). Spawns spoke for each deployment. Each spoke restarts independently.
Isolation: One deployment failure doesn't block others.
Scenario 3: Multi-Cluster Operations¶
Hub coordinates operations across 10 Kubernetes clusters. Spawns spoke for each cluster. Spokes execute in parallel using cluster-specific credentials.
Scale: Add clusters without changing hub logic.
Monitoring Hub and Spoke¶
Track hub and spoke metrics separately:
# Prometheus metrics
# Hub duration
argo_workflow_duration_seconds{workflow_template="hub-orchestrator"}
# Spoke success rate
sum(argo_workflow_status{workflow_template="spoke-worker",phase="Succeeded"})
/
sum(argo_workflow_status{workflow_template="spoke-worker"})
# Active spokes
count(argo_workflow_status{workflow_template="spoke-worker",phase="Running"})
Alert when:
- Hub fails (critical: no spokes spawn)
- Spoke failure rate > 10% (degraded operation)
- Spokes stuck running (timeout issue)
Related Patterns¶
- Separation of Concerns: Hub is orchestrator, spokes are executors
- Three-Stage Design: Discovery → Distribution → Summary
- Matrix Distribution: GitHub Actions equivalent
The hub spawned 100 spokes. 99 succeeded. 1 failed. The hub reported both. The system scaled. The failure was isolated. The operation completed in minutes, not hours.