Content-Addressable Storage

Cloudstic is built on a content-addressable storage (CAS) model, where every object is identified by a cryptographic hash of its contents. This enables automatic deduplication, structural sharing across snapshots, and strong integrity guarantees.

What is Content Addressing?

In a content-addressable system, the name of an object is derived from its content, not chosen arbitrarily. This has powerful implications:

Same content → Same hash → Stored onceIf two files (or file chunks) have identical content, they produce the same hash and are stored only once, regardless of their original names or locations.

Contrast with Traditional Storage

Traditional (location-addressed)	Content-addressed
Files identified by path: `/documents/report.pdf`	Objects identified by hash: `filemeta/4b8f1a...`
Moving a file changes its identity	Moving a file doesn’t change its content hash
Duplicates stored separately	Duplicates automatically deduplicated
No integrity verification by default	Built-in corruption detection

Object Addressing Scheme

All Cloudstic objects follow a flat namespace convention:

<type>/<hash>

Location: AGENTS.md:70-80 Examples:

chunk/a3f5b8c9d1e2f4a6... - A file data chunk
content/7d9e2c5f8b3a1d4e... - A content manifest
filemeta/4b8f1a2c3d5e6f7a... - File metadata
node/c9d2e5f7a8b1c3d4... - HAMT tree node
snapshot/6e3a9f1c4b7d2e5a... - A backup snapshot

Hash Functions by Object Type

Different object types use different hash functions for security and deduplication purposes:

Chunks: HMAC-SHA256 (when encrypted)

Chunks use keyed HMAC-SHA256 to prevent storage providers from confirming file contents by hashing known plaintext.

When encryption is enabled:

// Derive dedup key from encryption key
dedupKey := HKDF-SHA256(encryptionKey, info="cloudstic-dedup-mac-v1")

// Hash chunk data
chunkHash := HMAC-SHA256(dedupKey, chunkData)
chunkRef := "chunk/" + hex(chunkHash)

Location: pkg/crypto/crypto.go:124-133 and internal/engine/chunker.go:168-169 When encryption is disabled, plain SHA-256 is used:

chunkHash := SHA-256(chunkData)
chunkRef := "chunk/" + hex(chunkHash)

Location: internal/engine/chunker.go:171

Content Manifests: Raw File Hash

Content manifests (content/) use the plain SHA-256 or MD5 of the raw, unencrypted file content. This is a performance optimization: it allows Cloudstic to fetch the file hash directly from a remote source’s API (like Google Drive’s MD5), perform an Exists("content/<hash>") check, and completely skip downloading, chunking, and uploading the file if it’s already in the backup.

Metadata Objects: SHA-256(JSON)

Metadata objects (filemeta/, node/, snapshot/) use plain SHA-256 of their canonical JSON representation:

func ComputeJSONHash(obj interface{}) (string, []byte, error) {
    // Marshal to canonical JSON
    data := json.Marshal(obj)
    
    // Hash the JSON bytes
    hash := sha256.Sum256(data)
    
    return hex.EncodeToString(hash[:]), data, nil
}

Location: internal/core/models.go:45-51

Deduplication at Two Levels

Cloudstic performs deduplication at both the chunk level and the content level:

Chunk-Level Deduplication

Before writing a chunk, the engine checks if it already exists:

ref := "chunk/" + ComputeHMAC(dedupKey, chunkData)

exists, err := store.Exists(ctx, ref)
if exists {
    return ref, nil  // Skip upload, reuse existing chunk
}

store.Put(ctx, ref, chunkData)  // Upload new chunk

Location: internal/engine/chunker.go:166-186 This means:

Identical 1MB regions across different files share a single stored chunk
Within a single large file, any repeated 1MB segments are stored once
Across snapshots, unchanged chunks are never re-uploaded

Example: If you back up a 1GB database file and only 10MB changes, the next backup uploads only ~10 new chunks (10MB), not the entire 1GB.

Content-Level Deduplication

Before chunking a file, the engine checks if the entire file’s content already exists:

contentHash := source.GetContentHash(file)  // e.g., Drive MD5
contentRef := "content/" + contentHash

exists, err := store.Exists(ctx, contentRef)
if exists {
    // Entire file already backed up - skip chunking/upload
    return contentRef, nil
}

// File is new - chunk and upload
chunks, size, hash := chunker.ProcessStream(file)

Location: internal/engine/backup.go (BackupManager) This means:

A “new” file that’s identical to a previously backed-up file uploads zero bytes
Only a new filemeta object is created (a few hundred bytes)
The existing content and chunk objects are reused

Cross-Tenant Deduplication:

When encryption is enabled, each tenant has a unique dedup key, so cross-tenant deduplication does not occur. This is by design for privacy.

Location: docs/encryption.md:157-170

The Content Addressing Pipeline

Here’s how a file flows through the content-addressing pipeline during backup: Location: internal/engine/chunker.go:46-164 The HAMT (Hash Array Mapped Trie) is a Merkle tree, where:

Each node’s hash depends on its children’s hashes
Only nodes along the path of a change need to be rewritten
Unchanged subtrees are reused by reference

Merkle trees enable structural sharing: two snapshots that differ in only 10 files share 99%+ of their tree nodes.

Example: Modifying One File

When a single file changes:

A new filemeta object is created (different content hash)
The HAMT leaf containing that file is rewritten
All parent nodes up to the root are rewritten (hashes change)
All other nodes (A, C, D, F) are reused by reference

For a 1 million file backup with 100 changed files:

100 new filemeta objects (~10KB each = 1MB)
~20 new HAMT nodes (changed leaves + ancestors = ~5KB)
Total metadata overhead: ~1MB instead of re-writing the entire tree

Location: internal/hamt/hamt.go:183-303 and docs/spec.md:306-320

Content Addressing and Security

Preventing Plaintext Hash Leakage

If chunks were addressed by plain SHA-256, a malicious storage provider could:

Hash known plaintext (e.g., “password123”)
Check if chunk/<hash> exists in your backup
Confirm you have that content without decrypting anything

This is called a “confirmation-of-a-file” attack.

Cloudstic prevents this by using HMAC-SHA256 keyed by a secret dedup key derived from your encryption key.

Without the dedup key, the provider cannot reproduce your chunk references, even if they have the plaintext. Location: docs/encryption.md:150-155

Dedup Key Derivation

master_key (256-bit random)
    ↓
    HKDF-SHA256(info="cloudstic-backup-v1")
    ↓
encryption_key (256-bit for AES-256-GCM)
    ↓
    HKDF-SHA256(info="cloudstic-dedup-mac-v1")
    ↓
dedup_key (256-bit for HMAC-SHA256)

Location: pkg/crypto/crypto.go:111-126 and docs/encryption.md:106-134

Integrity Verification

Content addressing provides built-in integrity checking:

When you fetch chunk/a3f5b8c9..., you compute HMAC-SHA256(dedupKey, data)
If the result doesn’t match a3f5b8c9..., the data is corrupted or tampered with
The read fails immediately

No separate checksums or signatures are needed - the addressing scheme is the integrity check.

Small File Optimization

Very small files (< 4KB) are stored inline instead of chunked:

{
  "type": "content",
  "size": 2048,
  "data_inline_b64": "SGVsbG8gd29ybGQh..."
}

This avoids the overhead of creating a separate chunk object for tiny files. Location: internal/core/models.go:26

Immutability Guarantees

All content-addressed objects are write-once, read-many. Once chunk/a3f5b8c9... is written, it never changes.

Benefits:

Concurrent backups are safe: Two backups can write the same chunk simultaneously - they’ll both succeed because the content is identical
Crash safety: Partial writes create orphaned objects that don’t affect existing snapshots
No delete-then-write races: Deduplication via Exists() check means chunks are never deleted during active backups

The only mutable objects are:

index/latest - Points to the current snapshot
index/snapshots - Catalog of all snapshots (self-healing)
index/packs - Packfile catalog (rebuilt on load if corrupted)

Location: docs/storage-model.md:1-105

Garbage Collection

Because objects are immutable and shared across snapshots, deletion requires mark-and-sweep garbage collection:

Prune Algorithm

1. Mark Phase
   • List all snapshot/* keys
   • For each snapshot:
     - Walk HAMT nodes (node/*)
     - Collect filemeta/* refs
     - Collect content/* refs
     - Collect chunk/* refs
   • Result: set of all reachable keys

2. Sweep Phase
   • List all keys under chunk/, content/, filemeta/, node/
   • Delete any key NOT in the reachable set

3. Repack Phase (if packfiles enabled)
   • Identify fragmented packs (>30% wasted space)
   • Extract live objects
   • Re-bundle into new packs
   • Delete old packs

Location: docs/storage-model.md:64-78 and docs/spec.md:276-285

Prune requires an exclusive lock - no concurrent backups are allowed during garbage collection.

Content Addressing Benefits Summary

Benefit	How Content Addressing Enables It
Automatic deduplication	Same hash → stored once
Structural sharing	Merkle tree reuses unchanged subtrees by reference
Integrity verification	Hash mismatch = corruption detected
Immutability	Hash is derived from content → cannot change without invalidating refs
Crash safety	Append-only writes, orphaned objects are safe
Efficient snapshots	Only changed chunks/metadata are written
Multi-parent support	A file can have multiple parents without duplication
Efficient diff	Structural tree comparison (Merkle diff)

​What is Content Addressing?

​Contrast with Traditional Storage

​Object Addressing Scheme

​Hash Functions by Object Type

​Chunks: HMAC-SHA256 (when encrypted)

​Content Manifests: Raw File Hash

​Metadata Objects: SHA-256(JSON)

​Deduplication at Two Levels

​Chunk-Level Deduplication

​Content-Level Deduplication

​The Content Addressing Pipeline

​Structural Sharing via Merkle Tree

​Example: Modifying One File

​Content Addressing and Security

​Preventing Plaintext Hash Leakage

​Dedup Key Derivation

​Integrity Verification

​Small File Optimization

​Immutability Guarantees

​Garbage Collection

​Prune Algorithm

​Content Addressing Benefits Summary

​Further Reading

What is Content Addressing?

Contrast with Traditional Storage

Object Addressing Scheme

Hash Functions by Object Type

Chunks: HMAC-SHA256 (when encrypted)

Content Manifests: Raw File Hash

Metadata Objects: SHA-256(JSON)

Deduplication at Two Levels

Chunk-Level Deduplication

Content-Level Deduplication

The Content Addressing Pipeline

Structural Sharing via Merkle Tree

Example: Modifying One File

Content Addressing and Security

Preventing Plaintext Hash Leakage

Dedup Key Derivation

Integrity Verification

Small File Optimization

Immutability Guarantees

Garbage Collection

Prune Algorithm

Content Addressing Benefits Summary

Further Reading