Snapshot Structure - Cloudstic CLI

A snapshot in Cloudstic is a complete, immutable, point-in-time representation of your backup. Unlike incremental backup systems that require replaying deltas, every Cloudstic snapshot is a full checkpoint that can be restored independently.

Snapshot Object

A snapshot is a JSON object stored at snapshot/<sha256>:

{
  "version": 1,
  "created": "2025-12-01T12:00:00Z",
  "root": "node/c9d2e5f7a8b1c3d4...",
  "seq": 42,
  "source": {
    "type": "gdrive",
    "account": "user@gmail.com",
    "path": "my-drive://"
  },
  "meta": {
    "generator": "cloudstic-cli"
  },
  "tags": ["daily", "important"],
  "change_token": "12345"
}

Location: internal/core/models.go:78-89

Fields

Field	Description
`version`	Snapshot format version (currently 1)
`created`	ISO 8601 timestamp
`root`	Reference to the HAMT root node (`node/<hash>`)
`seq`	Monotonically increasing sequence number
`source`	Origin of the backup (type, account, path)
`meta`	Free-form metadata (generator, tags, etc.)
`tags`	User-defined labels for retention policies
`change_token`	Opaque token for incremental sources (optional)

Location: docs/spec.md:174-199

Every snapshot is a complete checkpoint - you can restore any snapshot without accessing previous backups.

The HAMT Tree

The heart of a snapshot is the HAMT (Hash Array Mapped Trie) - a persistent Merkle tree structure that maps file IDs to their metadata references.

What is a HAMT?

A HAMT is a data structure that combines:

Hash table performance (O(log₃₂ n) lookup)
Persistent updates (old versions remain valid)
Merkle tree properties (structural sharing via content addressing)

Location: internal/hamt/hamt.go:1-602

Structure

Cloudstic uses a 32-way branching HAMT:

5 bits per level → 2⁵ = 32 children per node

Configuration:

Bits per level: 5
Branching factor: 32
Max leaf size: 32 entries
Max depth: 6 levels

Location: internal/hamt/hamt.go:14-18

Node Types

Internal Node

{
  "type": "internal",
  "bitmap": 2348810305,
  "children": [
    "node/a1b2c3d4...",
    "node/e5f6g7h8..."
  ]
}

Fields:

bitmap (uint32) - Bit vector indicating which child slots are populated (popcount compression)
children - Array of node/<hash> references

Location: internal/core/models.go:55-60 and docs/spec.md:141-153

The bitmap is a compact representation of which children exist. Bit i set means child slot i is populated. The children array contains only the populated slots (no null entries).

Leaf Node

{
  "type": "leaf",
  "entries": [
    {
      "key": "1A2B3C4D",
      "filemeta": "filemeta/9c2e5f7a..."
    },
    {
      "key": "5E6F7G8H",
      "filemeta": "filemeta/3d4e5f6a..."
    }
  ]
}

Fields:

entries - Array of (key, filemeta ref) pairs
- key: File ID from the source (Google Drive file ID, relative path, etc.)
- filemeta: Reference to the file’s metadata object

Location: internal/core/models.go:62-66 and docs/spec.md:154-167

HAMT Operations

The HAMT supports purely functional operations:

type Tree struct {
    store ObjectStore
}

// Insert or update an entry, return new root
func (t *Tree) Insert(root, key, value string) (string, error)

// Look up a key, return its value
func (t *Tree) Lookup(root, key string) (string, error)

// Walk all entries
func (t *Tree) Walk(root string, fn func(key, value string) error) error

// Compare two trees
func (t *Tree) Diff(root1, root2 string, fn func(DiffEntry) error) error

// Collect all node refs (for GC)
func (t *Tree) NodeRefs(root string, fn func(ref string) error) error

Location: internal/hamt/hamt.go:41-107

All mutations are purely functional - Insert() returns a new root reference while the old root remains valid. This enables structural sharing between snapshots.

The HAMT is a Merkle tree: each node’s hash depends on its children’s hashes. When you modify one entry, only nodes along the path from that entry to the root need to be rewritten.

Example: Modifying One File

What changed:

1 leaf node (E' contains the updated filemeta ref)
2 internal nodes (B' and root_b have different hashes due to child changes)
Nodes A, C, D, F are reused by reference (same hashes)

Storage cost:

1 new filemeta object (~300 bytes)
3 new node objects (~150 bytes each)
Total: ~750 bytes for a metadata-only change in a 1M-file backup

Location: docs/spec.md:306-320

For a 1 million file backup where 100 files change:

100 new filemeta objects (~30KB)
~20 new HAMT nodes (~5KB)
Total metadata: ~35KB instead of re-writing the entire tree

Path Key Computation

To distribute keys evenly across the HAMT, the file ID is hashed:

pathKey := SHA-256(fileId)  // 256-bit hash

Location: internal/hamt/hamt.go:144-146

Navigating the Tree

At each level, 5 bits of the path key determine which child to follow:

Level 0: bits 0-4   → child index 0-31
Level 1: bits 5-9   → child index 0-31
Level 2: bits 10-14 → child index 0-31
...

Location: internal/hamt/hamt.go:148-162 Example:

pathKey = 0x3a7f5b...
Binary:   00111010 01111111 01011011...
          └───┬──┘ └───┬──┘ └───┬──┘
           Level 0  Level 1  Level 2
            (14)     (31)     (11)

At root (level 0): follow child 14
At level 1: follow child 31
At level 2: follow child 11
At level 3: reach leaf with matching entries

TransactionalStore

During a backup, the HAMT tree is updated incrementally as files are processed. To avoid writing intermediate nodes that may be superseded, Cloudstic uses a TransactionalStore.

Workflow

// 1. Create transactional store (in-memory buffer)
txStore := NewTransactionalStore(objectStore)
tree := NewTree(txStore)

// 2. Update tree (nodes buffered in memory)
for file := range files {
    root = tree.Insert(root, file.ID, file.MetaRef)
}

// 3. Flush only reachable nodes (BFS from root)
txStore.Flush(root)

// 4. Write snapshot pointing to final root
snapshot := Snapshot{Root: root, ...}

Location: internal/hamt/cache.go

Only nodes reachable from the final root are flushed to the object store. Intermediate nodes that were superseded during the backup are discarded.

This prevents orphaned nodes from accumulating in the store.

FileMeta Structure

Each HAMT entry points to a filemeta/<hash> object:

{
  "version": 1,
  "fileId": "1A2B3C4D",
  "name": "report.pdf",
  "type": "file",
  "parents": [
    "filemeta/a1b2c3d4...",
    "filemeta/e5f6g7h8..."
  ],
  "content_hash": "7d9e2c5f8b3a1d4e...",
  "content_ref": "b4e1d83f9a2ce5...",
  "size": 3500000,
  "mtime": 1710000000,
  "owner": "user@example.com",
  "extra": {
    "mimeType": "application/pdf"
  }
}

Location: internal/core/models.go:29-43 and docs/spec.md:103-130

Fields

Field	Description
`fileId`	Source-specific unique identifier (HAMT key)
`name`	File name
`type`	`"file"` or `"folder"`
`parents`	Array of `filemeta/<hash>` refs (NOT raw file IDs)
`paths`	Optional legacy compatibility field; typically omitted in newly written snapshots
`content_hash`	SHA-256 of raw file content
`content_ref`	HMAC of the content hash (keys the Content object)
`size`	File size in bytes (0 for folders)
`mtime`	Modification time (Unix timestamp)
`owner`	Owner email or username
`extra`	Source-specific metadata (MIME type, permissions, etc.)

Multi-parent support: A file can have multiple parents (Google Drive allows this). The parents array contains references to parent filemeta objects, not raw file IDs.For newly written snapshots, Cloudstic reconstructs full relative paths from the parent chain plus name instead of persisting a redundant paths array in every filemeta object.

Content Objects

The content_ref field in a FileMeta points to a content/<hash> object (if empty, falls back to content_hash for backwards compatibility):

{
  "type": "content",
  "size": 3500000,
  "chunks": [
    "chunk/a3f5b8c9...",
    "chunk/7d9e2c5f...",
    "chunk/4b8f1a2c..."
  ]
}

For very small files (< 4KB), data is inlined:

{
  "type": "content",
  "size": 2048,
  "data_inline_b64": "SGVsbG8gd29ybGQh..."
}

Location: internal/core/models.go:20-27 and docs/spec.md:84-100

Chunk Objects

Each chunk reference points to a chunk/<hash> object containing raw file data:

chunk/<hash>  →  [zstd-compressed bytes]

FastCDC boundaries: 512KB min, 1MB avg, 8MB max
Deduplicated by HMAC-SHA256 (encrypted) or SHA-256 (unencrypted)
Compressed with zstd
Encrypted with AES-256-GCM (if encryption enabled)

Location: docs/spec.md:65-82

Complete Snapshot Hierarchy

Here’s the full object graph for a snapshot: Object counts for a typical 1TB backup:

1 snapshot (~500 bytes)
~100,000 HAMT nodes (~500KB compressed)
~1,000,000 filemeta objects (~100MB compressed)
~50,000 content objects (~5MB)
~1,000,000 chunks (~1TB raw data, ~800GB compressed+encrypted)

Snapshot Discovery

Snapshots are discovered via two mechanisms:

1. Latest Pointer

// index/latest
{
  "latest_snapshot": "snapshot/6e3a9f1c...",
  "seq": 42
}

Location: internal/core/models.go:92-96 This mutable pointer provides fast access to the most recent snapshot.

2. Snapshot Catalog

// index/snapshots
[
  {
    "ref": "snapshot/6e3a9f1c...",
    "seq": 42,
    "created": "2025-12-01T12:00:00Z",
    "root": "node/c9d2e5f7...",
    "source": { "type": "gdrive", ... },
    "tags": ["daily"]
  },
  ...
]

Location: internal/core/models.go:98-110 and docs/spec.md:223 The catalog contains lightweight summaries of all snapshots, avoiding the need to fetch each snapshot object individually during listing operations.

The catalog is self-healing: if corrupted or missing, it’s rebuilt by listing snapshot/* objects and fetching their metadata.

Change Tokens (Incremental Sources)

For sources that support incremental scans (e.g., Google Drive Changes API), the snapshot stores an opaque change token:

{
  "change_token": "12345"
}

Location: docs/spec.md:203-212 On the next backup:

Read the previous snapshot’s change_token
Pass it to the source’s WalkChanges() method
Source returns only files changed since that token
New snapshot stores the updated token

This dramatically reduces scan time for large backups where few files change.

Snapshot Operations

Listing Snapshots

cloudstic list

Output:

Seq  Created              Source              Size    Files
 2025-12-01 12:00:00  gdrive:user@gm...   1.2 TB  1.0M
 2025-11-30 12:00:00  gdrive:user@gm...   1.2 TB  1.0M
 2025-11-29 12:00:00  gdrive:user@gm...   1.1 TB  980K

Location: cmd/cloudstic/main.go (runList)

Inspecting a Snapshot

cloudstic ls --snapshot 42
cloudstic ls latest

Output:

Type    Path                         Size     Modified
folder  /                            -        -
folder  /Documents                   -        -
file    /Documents/report.pdf        3.5 MB   2025-12-01 11:30:00
file    /Documents/invoice.pdf       2.1 MB   2025-11-28 09:15:00
...

Location: cmd/cloudstic/main.go (runLs)

Comparing Snapshots

cloudstic diff 40 42

Output:

Added:    20,000 files (15.2 GB)
Modified: 5,000 files (3.8 GB)
Deleted:  2,000 files (1.2 GB)

Details:
+ /Documents/new_report.pdf (3.5 MB)
~ /Documents/invoice.pdf (2.1 MB → 2.3 MB)
- /Old/archived.zip (50 MB)

Location: internal/hamt/hamt.go:70-83 (Diff method)

The Diff operation leverages the HAMT structure for efficient comparison - it only traverses subtrees where hashes differ.

Restoring a Snapshot

cloudstic restore --snapshot 42 --output backup.zip
cloudstic restore latest --output backup.zip

Restores the entire snapshot as a ZIP archive. Location: internal/engine/restore.go

Deleting a Snapshot

cloudstic forget --snapshot 40
cloudstic forget --snapshot 40 --prune

Deletes the snapshot object. Use --prune to also run garbage collection and reclaim space from unreferenced objects. Location: internal/engine/forget.go and docs/spec.md:276-279

Garbage Collection

Snapshots share objects (chunks, content, filemeta, nodes) via content addressing. Deleting a snapshot doesn’t immediately reclaim space - you need to run prune:

cloudstic prune

Algorithm:

1. Mark Phase
   • List all snapshot/* keys
   • For each snapshot:
     - Walk HAMT: collect all node/* refs
     - Collect all filemeta/* refs
     - Collect all content/* refs
     - Collect all chunk/* refs
   • Result: set of all reachable keys

2. Sweep Phase
   • List all keys under chunk/, content/, filemeta/, node/
   • Delete any key NOT in the reachable set

3. Repack Phase (if packfiles enabled)
   • Identify fragmented packs (>30% wasted space)
   • Extract live objects
   • Re-bundle into new packs
   • Delete old packs

Location: docs/storage-model.md:64-78 and docs/spec.md:280-285

Prune requires an exclusive lock (index/lock.exclusive). Any backup or restore that starts while prune is running will fail immediately with a lock error, and prune itself will fail immediately if a backup or restore is already in progress. If you run prune in a scheduled job, ensure it does not overlap with backup schedules.

Source Identity

The source field enables multi-source repositories:

{
  "source": {
    "type": "gdrive",
    "identity": "117200475532672775987",
    "path_id": "root",
    "drive_name": "My Drive",
    "account": "user@gmail.com",
    "path": "/"
  }
}

For local sources backing up portable drives, identity is anchored to the partition UUID:

{
  "source": {
    "type": "local",
    "identity": "A1B2C3D4-1234-5678-ABCD-EF0123456789",
    "path_id": "Photos",
    "drive_name": "MyDrive",
    "account": "mac-studio.local",
    "path": "Photos"
  }
}

Location: internal/core/models.go and docs/spec.md Snapshots with different source identities are treated independently:

Retention policies can target specific sources
Incremental backups look for the previous snapshot with the same source identity
Different Google accounts in the same repo won’t interfere
Portable drives are matched by stable identity regardless of hostname or mount point

Legacy snapshots may still include volume_uuid and volume_label; new releases keep reading them for compatibility fallback.

​Snapshot Object

​Fields

​The HAMT Tree

​What is a HAMT?

​Structure

​Node Types

​Internal Node

​Leaf Node

​HAMT Operations

​Structural Sharing

​Example: Modifying One File

​Path Key Computation

​Navigating the Tree

​TransactionalStore

​Workflow

​FileMeta Structure

​Fields

​Content Objects

​Chunk Objects

​Complete Snapshot Hierarchy

​Snapshot Discovery

​1. Latest Pointer

​2. Snapshot Catalog

​Change Tokens (Incremental Sources)

​Snapshot Operations

​Listing Snapshots

​Inspecting a Snapshot

​Comparing Snapshots

​Restoring a Snapshot

​Deleting a Snapshot

​Garbage Collection

​Source Identity

​Further Reading

Snapshot Object

Fields

The HAMT Tree

What is a HAMT?

Structure

Node Types

Internal Node

Leaf Node

HAMT Operations

Structural Sharing

Example: Modifying One File

Path Key Computation

Navigating the Tree

TransactionalStore

Workflow

FileMeta Structure

Fields

Content Objects

Chunk Objects

Complete Snapshot Hierarchy

Snapshot Discovery

1. Latest Pointer

2. Snapshot Catalog

Change Tokens (Incremental Sources)

Snapshot Operations

Listing Snapshots

Inspecting a Snapshot

Comparing Snapshots

Restoring a Snapshot

Deleting a Snapshot

Garbage Collection

Source Identity

Further Reading