Unique Identifiers¶

Make duplicates impossible by design.

The Pattern¶

# Generate deterministic ID from content
ID="resource-$(sha256sum content.txt | cut -c1-8)"
create_resource "$ID"

When IDs are derived deterministically from content or context, duplicate operations naturally target the same resource instead of creating new ones.

flowchart LR
    A[Content] --> B[Hash Function]
    B --> C[Deterministic ID]
    C --> D[Same Resource]

    A2[Same Content] --> B

    %% Ghostty Hardcore Theme
    style A fill:#65d9ef,color:#1b1d1e
    style A2 fill:#65d9ef,color:#1b1d1e
    style B fill:#9e6ffe,color:#1b1d1e
    style C fill:#fd971e,color:#1b1d1e
    style D fill:#a7e22e,color:#1b1d1e

Content-Addressable by Design

Same input always produces the same ID. Run twice with identical content, get identical IDs, target the same resource. Deduplication is automatic.

When to Use¶

Good Fit

Content-addressable storage (same content = same ID)
Batch operations where items need stable identifiers
Distributed systems where multiple workers might process the same item
Operations that should be deduplicated by content, not time

Poor Fit

Resources that need human-readable names
Operations where duplicate content should create separate resources
Simple cases where check-before-act suffices

Examples¶

Content-Based Branch Names¶

# Branch name derived from file content hash
CONTENT_HASH=$(sha256sum "$SOURCE_FILE" | cut -c1-8)
BRANCH="sync-${CONTENT_HASH}"

git checkout -b "$BRANCH"

If the same content is synced multiple times, it always uses the same branch name.

Deterministic PR Titles¶

# PR identifier based on source and target
PR_KEY="${SOURCE_REPO}-to-${TARGET_REPO}-${FILE_PATH}"
PR_TITLE="Sync: $PR_KEY"

# Check using the deterministic title
if ! gh pr list --search "in:title $PR_KEY" --json number | jq -e 'length > 0'; then
  gh pr create --title "$PR_TITLE" --body "Automated sync"
fi

Run-Based Identifiers¶

# Combine workflow context for unique but deterministic ID
RUN_KEY="${GITHUB_REPOSITORY}-${GITHUB_RUN_ID}-${GITHUB_RUN_ATTEMPT}"

# Use for artifacts, logs, or state files
ARTIFACT_NAME="build-${RUN_KEY}"

Date-Based Identifiers¶

# Daily report - same day = same file
REPORT_DATE=$(date +%Y-%m-%d)
REPORT_FILE="reports/daily-${REPORT_DATE}.json"

# Running twice on same day overwrites same file (idempotent)
generate_report > "$REPORT_FILE"

Composite Keys¶

# Combine multiple attributes for uniqueness
DEPLOY_KEY="${ENVIRONMENT}-${SERVICE}-${VERSION}"
TAG="deploy-${DEPLOY_KEY}"

# Tagging is idempotent when tag name is deterministic
git tag -f "$TAG"

Hash-Based Strategies¶

SHA-256 for Content¶

# Full hash for maximum uniqueness
HASH=$(sha256sum file.txt | cut -c1-64)

# Short hash for readability (collision risk increases)
SHORT_HASH=$(sha256sum file.txt | cut -c1-8)

Hash Length Tradeoffs

Length	Bits	Collision likely at
4 chars	16	~256 items
8 chars	32	~65,000 items
16 chars	64	~4 billion items

For small datasets, 8 chars is usually fine. For large-scale systems, use 16+.

MD5 for Speed (Non-Security)¶

# Faster but not cryptographically secure
HASH=$(md5sum file.txt | cut -c1-32)

Git Object Hash¶

# Hash content as git would
HASH=$(git hash-object file.txt)

Combined Hashes¶

# Hash multiple inputs together
COMBINED_HASH=$(echo "${INPUT1}${INPUT2}${INPUT3}" | sha256sum | cut -c1-16)

GitHub Actions Examples¶

Content-Addressed Cache Keys¶

- uses: actions/cache@v4
  with:
    path: node_modules
    # Key derived from lock file content
    key: npm-${{ hashFiles('package-lock.json') }}

GitHub's hashFiles Function

hashFiles() is purpose-built for deterministic cache keys. It handles glob patterns, sorts files consistently, and produces stable SHA-256 hashes across runs.

Deterministic Artifact Names¶

- name: Generate artifact name
  id: artifact
  run: |
    # Artifact name from commit + workflow
    NAME="build-${{ github.sha }}-${{ github.run_number }}"
    echo "name=$NAME" >> "$GITHUB_OUTPUT"

- uses: actions/upload-artifact@v4
  with:
    name: ${{ steps.artifact.outputs.name }}
    path: dist/

Matrix-Based Identifiers¶

jobs:
  build:
    strategy:
      matrix:
        os: [ubuntu, windows, macos]
        node: [18, 20]
    steps:
      - name: Set cache key
        run: |
          # Unique per matrix combination
          KEY="deps-${{ matrix.os }}-node${{ matrix.node }}-${{ hashFiles('**/package-lock.json') }}"
          echo "CACHE_KEY=$KEY" >> "$GITHUB_ENV"

Database Patterns¶

Natural Keys¶

-- Use business identifier instead of auto-increment
CREATE TABLE users (
  email VARCHAR(255) PRIMARY KEY,  -- Natural key
  name VARCHAR(255)
);

-- Insert is idempotent on email
INSERT INTO users (email, name)
VALUES ('user@example.com', 'User')
ON CONFLICT (email) DO UPDATE SET name = EXCLUDED.name;

Idempotency Keys¶

-- Track processed requests by unique key
CREATE TABLE processed_requests (
  idempotency_key VARCHAR(64) PRIMARY KEY,
  result JSONB,
  created_at TIMESTAMP DEFAULT NOW()
);

-- Check before processing
SELECT result FROM processed_requests WHERE idempotency_key = $1;

Edge Cases and Gotchas¶

Hash Collisions¶

Short hashes increase collision probability:

# 8 hex chars = 32 bits = collision likely around 65k items
SHORT_HASH=$(sha256sum file.txt | cut -c1-8)

Mitigation: Use longer hashes for large datasets, or accept rare collisions.

Unstable Inputs¶

Non-deterministic content produces different hashes:

# Bad: timestamp changes hash every run
echo "Generated at $(date)" > file.txt
HASH=$(sha256sum file.txt | cut -c1-8)

# Good: exclude timestamps from hashed content
echo "Static content" > file.txt
HASH=$(sha256sum file.txt | cut -c1-8)

Whitespace and Encoding¶

Different platforms might produce different hashes:

# Windows vs Unix line endings
echo -e "line1\r\nline2" | sha256sum  # Different from
echo -e "line1\nline2" | sha256sum

Mitigation: Normalize content before hashing.

Hash Algorithm Changes¶

Changing algorithms invalidates all existing IDs:

# Migration nightmare if you switch from MD5 to SHA-256
OLD_ID=$(md5sum file.txt | cut -c1-32)
NEW_ID=$(sha256sum file.txt | cut -c1-64)  # Different!

Anti-Patterns¶

Random Components in "Deterministic" IDs¶

# Bad: includes random element
ID="resource-$(date +%s)-$RANDOM"

# Good: fully deterministic
ID="resource-$(sha256sum content.txt | cut -c1-16)"

Hashing Volatile Data¶

# Bad: process ID changes every run
ID="job-$$-$(sha256sum file.txt | cut -c1-8)"

# Good: only hash stable content
ID="job-$(sha256sum file.txt | cut -c1-8)"

Ignoring Hash Prefix Collisions¶

# Risky with many items
ID=$(sha256sum file.txt | cut -c1-4)  # Only 16 bits!

# Safer
ID=$(sha256sum file.txt | cut -c1-16)  # 64 bits

Comparison with Other Patterns¶

Aspect	Check-Before-Act	Upsert	Unique Identifiers
Requires existence check	Yes	No	No
Natural deduplication	No	No	Yes
ID predictability	N/A	N/A	High
Complexity	Low	Low	Medium

Summary¶

Unique identifiers make idempotency automatic.

Key Takeaways

Derive IDs from content - same input = same ID = same resource
Use sufficient hash length - balance readability vs collision risk
Keep inputs stable - don't hash timestamps or random data
Consider the scope - workflow run, day, content version?