Snapshot Object
A snapshot is a JSON object stored atsnapshot/<sha256>:
internal/core/models.go:78-89
Fields
| Field | Description |
|---|---|
version | Snapshot format version (currently 1) |
created | ISO 8601 timestamp |
root | Reference to the HAMT root node (node/<hash>) |
seq | Monotonically increasing sequence number |
source | Origin of the backup (type, account, path) |
meta | Free-form metadata (generator, tags, etc.) |
tags | User-defined labels for retention policies |
change_token | Opaque token for incremental sources (optional) |
docs/spec.md:174-199
Every snapshot is a complete checkpoint - you can restore any snapshot without accessing previous backups.
The HAMT Tree
The heart of a snapshot is the HAMT (Hash Array Mapped Trie) - a persistent Merkle tree structure that maps file IDs to their metadata references.What is a HAMT?
A HAMT is a data structure that combines:- Hash table performance (O(log₃₂ n) lookup)
- Persistent updates (old versions remain valid)
- Merkle tree properties (structural sharing via content addressing)
internal/hamt/hamt.go:1-602
Structure
Cloudstic uses a 32-way branching HAMT:- Bits per level: 5
- Branching factor: 32
- Max leaf size: 32 entries
- Max depth: 6 levels
internal/hamt/hamt.go:14-18
Node Types
Internal Node
bitmap(uint32) - Bit vector indicating which child slots are populated (popcount compression)children- Array ofnode/<hash>references
internal/core/models.go:55-60 and docs/spec.md:141-153
The bitmap is a compact representation of which children exist. Bit
i set means child slot i is populated. The children array contains only the populated slots (no null entries).Leaf Node
entries- Array of (key, filemeta ref) pairs- key: File ID from the source (Google Drive file ID, relative path, etc.)
- filemeta: Reference to the file’s metadata object
internal/core/models.go:62-66 and docs/spec.md:154-167
HAMT Operations
The HAMT supports purely functional operations:internal/hamt/hamt.go:41-107
Structural Sharing
The HAMT is a Merkle tree: each node’s hash depends on its children’s hashes. When you modify one entry, only nodes along the path from that entry to the root need to be rewritten.Example: Modifying One File
- 1 leaf node (
E'contains the updated filemeta ref) - 2 internal nodes (
B'androot_bhave different hashes due to child changes) - Nodes A, C, D, F are reused by reference (same hashes)
- 1 new
filemetaobject (~300 bytes) - 3 new
nodeobjects (~150 bytes each) - Total: ~750 bytes for a metadata-only change in a 1M-file backup
docs/spec.md:306-320
For a 1 million file backup where 100 files change:
- 100 new filemeta objects (~30KB)
- ~20 new HAMT nodes (~5KB)
- Total metadata: ~35KB instead of re-writing the entire tree
Path Key Computation
To distribute keys evenly across the HAMT, the file ID is hashed:internal/hamt/hamt.go:144-146
Navigating the Tree
At each level, 5 bits of the path key determine which child to follow:internal/hamt/hamt.go:148-162
Example:
- At root (level 0): follow child 14
- At level 1: follow child 31
- At level 2: follow child 11
- At level 3: reach leaf with matching entries
TransactionalStore
During a backup, the HAMT tree is updated incrementally as files are processed. To avoid writing intermediate nodes that may be superseded, Cloudstic uses a TransactionalStore.Workflow
internal/hamt/cache.go
Only nodes reachable from the final root are flushed to the object store. Intermediate nodes that were superseded during the backup are discarded.
FileMeta Structure
Each HAMT entry points to afilemeta/<hash> object:
internal/core/models.go:29-43 and docs/spec.md:103-130
Fields
| Field | Description |
|---|---|
fileId | Source-specific unique identifier (HAMT key) |
name | File name |
type | "file" or "folder" |
parents | Array of filemeta/<hash> refs (NOT raw file IDs) |
content_hash | SHA-256 of raw file content (keys the Content object) |
size | File size in bytes (0 for folders) |
mtime | Modification time (Unix timestamp) |
owner | Owner email or username |
extra | Source-specific metadata (MIME type, permissions, etc.) |
Multi-parent support: A file can have multiple parents (Google Drive allows this). The
parents array contains references to parent filemeta objects, not raw file IDs.Content Objects
Thecontent_hash field in a FileMeta points to a content/<hash> object:
internal/core/models.go:20-27 and docs/spec.md:84-100
Chunk Objects
Each chunk reference points to achunk/<hash> object containing raw file data:
- FastCDC boundaries: 512KB min, 1MB avg, 8MB max
- Deduplicated by HMAC-SHA256 (encrypted) or SHA-256 (unencrypted)
- Compressed with zstd
- Encrypted with AES-256-GCM (if encryption enabled)
docs/spec.md:65-82
Complete Snapshot Hierarchy
Here’s the full object graph for a snapshot:- 1 snapshot (~500 bytes)
- ~100,000 HAMT nodes (~500KB compressed)
- ~1,000,000 filemeta objects (~100MB compressed)
- ~50,000 content objects (~5MB)
- ~1,000,000 chunks (~1TB raw data, ~800GB compressed+encrypted)
Snapshot Discovery
Snapshots are discovered via two mechanisms:1. Latest Pointer
internal/core/models.go:92-96
This mutable pointer provides fast access to the most recent snapshot.
2. Snapshot Catalog
internal/core/models.go:98-110 and docs/spec.md:223
The catalog contains lightweight summaries of all snapshots, avoiding the need to fetch each snapshot object individually during listing operations.
Change Tokens (Incremental Sources)
For sources that support incremental scans (e.g., Google Drive Changes API), the snapshot stores an opaque change token:docs/spec.md:203-212
On the next backup:
- Read the previous snapshot’s
change_token - Pass it to the source’s
WalkChanges()method - Source returns only files changed since that token
- New snapshot stores the updated token
Snapshot Operations
Listing Snapshots
cmd/cloudstic/main.go (runList)
Inspecting a Snapshot
cmd/cloudstic/main.go (runLs)
Comparing Snapshots
internal/hamt/hamt.go:70-83 (Diff method)
The
Diff operation leverages the HAMT structure for efficient comparison - it only traverses subtrees where hashes differ.Restoring a Snapshot
internal/engine/restore.go
Deleting a Snapshot
--prune to also run garbage collection and reclaim space from unreferenced objects.
Location: internal/engine/forget.go and docs/spec.md:276-279
Garbage Collection
Snapshots share objects (chunks, content, filemeta, nodes) via content addressing. Deleting a snapshot doesn’t immediately reclaim space - you need to run prune:docs/storage-model.md:64-78 and docs/spec.md:280-285
Source Identity
Thesource field enables multi-source repositories:
internal/core/models.go:68-75 and docs/spec.md:192-196
Snapshots with different source identities are treated independently:
- Retention policies can target specific sources
- Incremental backups look for the previous snapshot with the same source identity
- Different Google accounts in the same repo won’t interfere
Further Reading
- Content Addressing - How objects are identified and deduplicated
- Architecture - How snapshots are created during backup
- Encryption - How snapshot data is encrypted at rest