What is Content Addressing?
In a content-addressable system, the name of an object is derived from its content, not chosen arbitrarily. This has powerful implications:Same content → Same hash → Stored onceIf two files (or file chunks) have identical content, they produce the same hash and are stored only once, regardless of their original names or locations.
Contrast with Traditional Storage
| Traditional (location-addressed) | Content-addressed |
|---|---|
Files identified by path: /documents/report.pdf | Objects identified by hash: filemeta/4b8f1a... |
| Moving a file changes its identity | Moving a file doesn’t change its content hash |
| Duplicates stored separately | Duplicates automatically deduplicated |
| No integrity verification by default | Built-in corruption detection |
Object Addressing Scheme
All Cloudstic objects follow a flat namespace convention:AGENTS.md:70-80
Examples:
chunk/a3f5b8c9d1e2f4a6...- A file data chunkcontent/7d9e2c5f8b3a1d4e...- A content manifestfilemeta/4b8f1a2c3d5e6f7a...- File metadatanode/c9d2e5f7a8b1c3d4...- HAMT tree nodesnapshot/6e3a9f1c4b7d2e5a...- A backup snapshot
Hash Functions by Object Type
Different object types use different hash functions for security and deduplication purposes:Chunks: HMAC-SHA256 (when encrypted)
Chunks use keyed HMAC-SHA256 to prevent storage providers from confirming file contents by hashing known plaintext.
pkg/crypto/crypto.go:124-133 and internal/engine/chunker.go:168-169
When encryption is disabled, plain SHA-256 is used:
internal/engine/chunker.go:171
All Other Objects: SHA-256
Metadata objects (content/, filemeta/, node/, snapshot/) use plain SHA-256 of their canonical JSON representation:
internal/core/models.go:45-51
Deduplication at Two Levels
Cloudstic performs deduplication at both the chunk level and the content level:Chunk-Level Deduplication
Before writing a chunk, the engine checks if it already exists:internal/engine/chunker.go:166-186
This means:
- Identical 1MB regions across different files share a single stored chunk
- Within a single large file, any repeated 1MB segments are stored once
- Across snapshots, unchanged chunks are never re-uploaded
Content-Level Deduplication
Before chunking a file, the engine checks if the entire file’s content already exists:internal/engine/backup.go (BackupManager)
This means:
- A “new” file that’s identical to a previously backed-up file uploads zero bytes
- Only a new
filemetaobject is created (a few hundred bytes) - The existing
contentandchunkobjects are reused
When encryption is enabled, each tenant has a unique dedup key, so cross-tenant deduplication does not occur. This is by design for privacy.
docs/encryption.md:157-170
The Content Addressing Pipeline
Here’s how a file flows through the content-addressing pipeline during backup:internal/engine/chunker.go:46-164
Structural Sharing via Merkle Tree
The HAMT (Hash Array Mapped Trie) is a Merkle tree, where:- Each node’s hash depends on its children’s hashes
- Only nodes along the path of a change need to be rewritten
- Unchanged subtrees are reused by reference
Merkle trees enable structural sharing: two snapshots that differ in only 10 files share 99%+ of their tree nodes.
Example: Modifying One File
- A new
filemetaobject is created (different content hash) - The HAMT leaf containing that file is rewritten
- All parent nodes up to the root are rewritten (hashes change)
- All other nodes (
A,C,D,F) are reused by reference
- 100 new
filemetaobjects (~10KB each = 1MB) - ~20 new HAMT nodes (changed leaves + ancestors = ~5KB)
- Total metadata overhead: ~1MB instead of re-writing the entire tree
internal/hamt/hamt.go:183-303 and docs/spec.md:306-320
Content Addressing and Security
Preventing Plaintext Hash Leakage
If chunks were addressed by plain SHA-256, a malicious storage provider could:- Hash known plaintext (e.g., “password123”)
- Check if
chunk/<hash>exists in your backup - Confirm you have that content without decrypting anything
Cloudstic prevents this by using HMAC-SHA256 keyed by a secret dedup key derived from your encryption key.
docs/encryption.md:150-155
Dedup Key Derivation
pkg/crypto/crypto.go:111-126 and docs/encryption.md:106-134
Integrity Verification
Content addressing provides built-in integrity checking:- When you fetch
chunk/a3f5b8c9..., you computeHMAC-SHA256(dedupKey, data) - If the result doesn’t match
a3f5b8c9..., the data is corrupted or tampered with - The read fails immediately
Small File Optimization
Very small files (< 4KB) are stored inline instead of chunked:internal/core/models.go:26
Immutability Guarantees
Benefits:- Concurrent backups are safe: Two backups can write the same chunk simultaneously - they’ll both succeed because the content is identical
- Crash safety: Partial writes create orphaned objects that don’t affect existing snapshots
- No delete-then-write races: Deduplication via
Exists()check means chunks are never deleted during active backups
index/latest- Points to the current snapshotindex/snapshots- Catalog of all snapshots (self-healing)index/packs- Packfile catalog (rebuilt on load if corrupted)
docs/storage-model.md:1-105
Garbage Collection
Because objects are immutable and shared across snapshots, deletion requires mark-and-sweep garbage collection:Prune Algorithm
docs/storage-model.md:64-78 and docs/spec.md:276-285
Prune requires an exclusive lock - no concurrent backups are allowed during garbage collection.
Content Addressing Benefits Summary
| Benefit | How Content Addressing Enables It |
|---|---|
| Automatic deduplication | Same hash → stored once |
| Structural sharing | Merkle tree reuses unchanged subtrees by reference |
| Integrity verification | Hash mismatch = corruption detected |
| Immutability | Hash is derived from content → cannot change without invalidating refs |
| Crash safety | Append-only writes, orphaned objects are safe |
| Efficient snapshots | Only changed chunks/metadata are written |
| Multi-parent support | A file can have multiple parents without duplication |
| Efficient diff | Structural tree comparison (Merkle diff) |
Further Reading
- Snapshots - How content-addressed objects are assembled into point-in-time backups
- Encryption - How HMAC-keyed hashing prevents plaintext confirmation attacks
- Storage Model - How the object store manages content-addressed objects