Unique Identifiers¶
Make duplicates impossible by design.
The Pattern¶
# Generate deterministic ID from content
ID="resource-$(sha256sum content.txt | cut -c1-8)"
create_resource "$ID"
When IDs are derived deterministically from content or context, duplicate operations naturally target the same resource instead of creating new ones.
flowchart LR
A[Content] --> B[Hash Function]
B --> C[Deterministic ID]
C --> D[Same Resource]
A2[Same Content] --> B
%% Ghostty Hardcore Theme
style A fill:#65d9ef,color:#1b1d1e
style A2 fill:#65d9ef,color:#1b1d1e
style B fill:#9e6ffe,color:#1b1d1e
style C fill:#fd971e,color:#1b1d1e
style D fill:#a7e22e,color:#1b1d1e
Content-Addressable by Design
Same input always produces the same ID. Run twice with identical content, get identical IDs, target the same resource. Deduplication is automatic.
When to Use¶
Good Fit
- Content-addressable storage (same content = same ID)
- Batch operations where items need stable identifiers
- Distributed systems where multiple workers might process the same item
- Operations that should be deduplicated by content, not time
Poor Fit
- Resources that need human-readable names
- Operations where duplicate content should create separate resources
- Simple cases where check-before-act suffices
Examples¶
Content-Based Branch Names¶
# Branch name derived from file content hash
CONTENT_HASH=$(sha256sum "$SOURCE_FILE" | cut -c1-8)
BRANCH="sync-${CONTENT_HASH}"
git checkout -b "$BRANCH"
If the same content is synced multiple times, it always uses the same branch name.
Deterministic PR Titles¶
# PR identifier based on source and target
PR_KEY="${SOURCE_REPO}-to-${TARGET_REPO}-${FILE_PATH}"
PR_TITLE="Sync: $PR_KEY"
# Check using the deterministic title
if ! gh pr list --search "in:title $PR_KEY" --json number | jq -e 'length > 0'; then
gh pr create --title "$PR_TITLE" --body "Automated sync"
fi
Run-Based Identifiers¶
# Combine workflow context for unique but deterministic ID
RUN_KEY="${GITHUB_REPOSITORY}-${GITHUB_RUN_ID}-${GITHUB_RUN_ATTEMPT}"
# Use for artifacts, logs, or state files
ARTIFACT_NAME="build-${RUN_KEY}"
Date-Based Identifiers¶
# Daily report - same day = same file
REPORT_DATE=$(date +%Y-%m-%d)
REPORT_FILE="reports/daily-${REPORT_DATE}.json"
# Running twice on same day overwrites same file (idempotent)
generate_report > "$REPORT_FILE"
Composite Keys¶
# Combine multiple attributes for uniqueness
DEPLOY_KEY="${ENVIRONMENT}-${SERVICE}-${VERSION}"
TAG="deploy-${DEPLOY_KEY}"
# Tagging is idempotent when tag name is deterministic
git tag -f "$TAG"
Hash-Based Strategies¶
SHA-256 for Content¶
# Full hash for maximum uniqueness
HASH=$(sha256sum file.txt | cut -c1-64)
# Short hash for readability (collision risk increases)
SHORT_HASH=$(sha256sum file.txt | cut -c1-8)
Hash Length Tradeoffs
| Length | Bits | Collision likely at |
|---|---|---|
| 4 chars | 16 | ~256 items |
| 8 chars | 32 | ~65,000 items |
| 16 chars | 64 | ~4 billion items |
For small datasets, 8 chars is usually fine. For large-scale systems, use 16+.
MD5 for Speed (Non-Security)¶
Git Object Hash¶
Combined Hashes¶
# Hash multiple inputs together
COMBINED_HASH=$(echo "${INPUT1}${INPUT2}${INPUT3}" | sha256sum | cut -c1-16)
GitHub Actions Examples¶
Content-Addressed Cache Keys¶
- uses: actions/cache@v4
with:
path: node_modules
# Key derived from lock file content
key: npm-${{ hashFiles('package-lock.json') }}
GitHub's hashFiles Function
hashFiles() is purpose-built for deterministic cache keys. It handles glob patterns, sorts files consistently, and produces stable SHA-256 hashes across runs.
Deterministic Artifact Names¶
- name: Generate artifact name
id: artifact
run: |
# Artifact name from commit + workflow
NAME="build-${{ github.sha }}-${{ github.run_number }}"
echo "name=$NAME" >> "$GITHUB_OUTPUT"
- uses: actions/upload-artifact@v4
with:
name: ${{ steps.artifact.outputs.name }}
path: dist/
Matrix-Based Identifiers¶
jobs:
build:
strategy:
matrix:
os: [ubuntu, windows, macos]
node: [18, 20]
steps:
- name: Set cache key
run: |
# Unique per matrix combination
KEY="deps-${{ matrix.os }}-node${{ matrix.node }}-${{ hashFiles('**/package-lock.json') }}"
echo "CACHE_KEY=$KEY" >> "$GITHUB_ENV"
Database Patterns¶
Natural Keys¶
-- Use business identifier instead of auto-increment
CREATE TABLE users (
email VARCHAR(255) PRIMARY KEY, -- Natural key
name VARCHAR(255)
);
-- Insert is idempotent on email
INSERT INTO users (email, name)
VALUES ('user@example.com', 'User')
ON CONFLICT (email) DO UPDATE SET name = EXCLUDED.name;
Idempotency Keys¶
-- Track processed requests by unique key
CREATE TABLE processed_requests (
idempotency_key VARCHAR(64) PRIMARY KEY,
result JSONB,
created_at TIMESTAMP DEFAULT NOW()
);
-- Check before processing
SELECT result FROM processed_requests WHERE idempotency_key = $1;
Edge Cases and Gotchas¶
Hash Collisions¶
Short hashes increase collision probability:
# 8 hex chars = 32 bits = collision likely around 65k items
SHORT_HASH=$(sha256sum file.txt | cut -c1-8)
Mitigation: Use longer hashes for large datasets, or accept rare collisions.
Unstable Inputs¶
Non-deterministic content produces different hashes:
# Bad: timestamp changes hash every run
echo "Generated at $(date)" > file.txt
HASH=$(sha256sum file.txt | cut -c1-8)
# Good: exclude timestamps from hashed content
echo "Static content" > file.txt
HASH=$(sha256sum file.txt | cut -c1-8)
Whitespace and Encoding¶
Different platforms might produce different hashes:
# Windows vs Unix line endings
echo -e "line1\r\nline2" | sha256sum # Different from
echo -e "line1\nline2" | sha256sum
Mitigation: Normalize content before hashing.
Hash Algorithm Changes¶
Changing algorithms invalidates all existing IDs:
# Migration nightmare if you switch from MD5 to SHA-256
OLD_ID=$(md5sum file.txt | cut -c1-32)
NEW_ID=$(sha256sum file.txt | cut -c1-64) # Different!
Anti-Patterns¶
Random Components in "Deterministic" IDs¶
# Bad: includes random element
ID="resource-$(date +%s)-$RANDOM"
# Good: fully deterministic
ID="resource-$(sha256sum content.txt | cut -c1-16)"
Hashing Volatile Data¶
# Bad: process ID changes every run
ID="job-$$-$(sha256sum file.txt | cut -c1-8)"
# Good: only hash stable content
ID="job-$(sha256sum file.txt | cut -c1-8)"
Ignoring Hash Prefix Collisions¶
# Risky with many items
ID=$(sha256sum file.txt | cut -c1-4) # Only 16 bits!
# Safer
ID=$(sha256sum file.txt | cut -c1-16) # 64 bits
Comparison with Other Patterns¶
| Aspect | Check-Before-Act | Upsert | Unique Identifiers |
|---|---|---|---|
| Requires existence check | Yes | No | No |
| Natural deduplication | No | No | Yes |
| ID predictability | N/A | N/A | High |
| Complexity | Low | Low | Medium |
Summary¶
Unique identifiers make idempotency automatic.
Key Takeaways
- Derive IDs from content - same input = same ID = same resource
- Use sufficient hash length - balance readability vs collision risk
- Keep inputs stable - don't hash timestamps or random data
- Consider the scope - workflow run, day, content version?