HAMT Structure - Cloudstic CLI

Overview

Cloudstic uses a Merkle Hash Array Mapped Trie (HAMT) to map file IDs to their metadata references. This data structure provides:

Efficient lookups: O(log₃₂ n) depth due to 32-way branching
Structural sharing: unchanged subtrees are reused by reference between snapshots
Persistent immutability: all mutations return a new root while preserving the old one
Content-addressable nodes: nodes are identified by the hash of their content

Configuration

From internal/hamt/hamt.go:

const (
    bitsPerLevel = 5      // 5 bits per level → 32-way branching
    branching    = 32     // 2^bitsPerLevel
    maxDepth     = 6      // Maximum tree depth
    maxLeafSize  = 32     // Maximum entries per leaf node
)

Design Rationale

5 bits per level provides 32-way branching, balancing tree depth with node size
Max depth of 6 supports 32^6 = 1,073,741,824 entries before hitting depth limit
Max leaf size of 32 keeps leaf nodes small enough to serialize efficiently

Node Types

Internal Node

Internal nodes use a bitmap to compactly represent which of the 32 child slots are populated:

{
  "type": "internal",
  "bitmap": 2348810305,
  "children": ["node/<sha256>", "node/<sha256>"]
}

// From internal/core/models.go
type HAMTNode struct {
    Type     ObjectType  `json:"type"`     // "internal" or "leaf"
    Bitmap   uint32      `json:"bitmap,omitempty"`
    Children []string    `json:"children,omitempty"` // ["node/<sha256>", ...]
    Entries  []LeafEntry `json:"entries,omitempty"`
}

Bitmap encoding:

Each bit position (0-31) corresponds to a child slot
A set bit indicates that slot is populated
The children array contains only the populated slots, in order
Child position is computed via popcount(bitmap & (bit - 1))

The bitmap compression reduces storage: a node with only 2 children stores 2 refs instead of a 32-element array.

Leaf Node

Leaf nodes store actual key-value entries:

{
  "type": "leaf",
  "entries": [
    { "key": "<fileId>", "filemeta": "filemeta/<sha256>" },
    { "key": "<fileId>", "filemeta": "filemeta/<sha256>" }
  ]
}

// From internal/core/models.go
type LeafEntry struct {
    Key      string `json:"key"`      // FileID
    FileMeta string `json:"filemeta"` // "filemeta/<sha256>"
}

Entries are sorted by key for deterministic hashing. Two leaves with the same entries in different orders produce the same hash.

Path Key Computation

File IDs are hashed to produce a “path key” that determines the traversal path through the tree:

// From internal/hamt/hamt.go
func computePathKey(id string) string {
    return core.ComputeHash([]byte(id))
}

func indexForLevel(keyHex string, level int) (int, error) {
    // Parse first 8 hex chars (32 bits) of the hash
    val, err := strconv.ParseUint(keyHex[:8], 16, 32)
    if err != nil {
        return 0, err
    }
    // Extract 5 bits for this level
    shift := 32 - (level+1)*bitsPerLevel
    mask := uint64((1 << bitsPerLevel) - 1)
    return int((val >> shift) & mask), nil
}

Example traversal:

Hash the file ID → a7f3c92e...
Extract bits 27-31 for level 0 → index 20
Extract bits 22-26 for level 1 → index 15
Continue until reaching a leaf or empty slot

Affinity Model (Locality-Preserving Keys)

Implemented in PR #61 as HAMTv2.

By default, SHA-256(fileID) produces uniformly distributed keys, which means files sharing the same parent directory scatter across all 32 top-level trie buckets. On an incremental backup of a directory with N changed files, this causes O(N · depth) intermediate node rewrites. The affinity model biases the routing key so that siblings share a common trie subtree:

AffinityKey(parentID, fileID) = SHA256(parentID)[:4] + SHA256(fileID)[4:]

The first 4 hex chars (16 bits) come from the parent directory’s hash, pinning all siblings to the same top-3 trie levels.
The remaining 28 hex chars come from the file’s own hash, uniquely distributing siblings within that subtree.

The total key length remains 32 hex characters. The routing machinery is unchanged.

Locality Guarantee

With the affinity model, a backup of a flat directory with N changed files rewrites O(maxDepth) nodes instead of O(N · maxDepth). Siblings converge to a single subtree root.

Snapshot Version

New snapshots are tagged with hamt_version: 2. Older snapshots (no field or hamt_version: 1) continue to use plain SHA-256 keys and remain fully readable.

File Moves

When a file moves to a new parent, its affinity key changes. The engine issues an explicit Delete(oldKey) + Insert(newKey) pair. Sources that emit move events (e.g. Google Drive change tokens) handle this naturally.

Operations

The Tree type exposes a functional API where all mutations return a new root:

// From internal/hamt/hamt.go
type Tree struct {
    store store.ObjectStore
}

func (t *Tree) Insert(root, key, value string) (string, error)
func (t *Tree) Lookup(root, key string) (string, error)
func (t *Tree) Delete(root, key string) (string, error)
func (t *Tree) Walk(root string, fn func(key, value string) error) error
func (t *Tree) Diff(root1, root2 string, fn func(DiffEntry) error) error
func (t *Tree) NodeRefs(root string, fn func(ref string) error) error

Insert Algorithm

Lookup existing node at the current level
If node is a leaf:
- If entry exists, update it in place (copy-on-write)
- If leaf has room (< 32 entries), append the new entry
- If leaf is full and we’re not at max depth, split into an internal node
If node is internal:
- Compute the child index for this level
- Recursively insert into the appropriate child
- Update the parent’s child reference
Save the modified node and return its new reference

// From internal/hamt/hamt.go:insertIntoInternal
bit := uint32(1 << idx)
exists := node.Bitmap&bit != 0
childPos := popcount(node.Bitmap & (bit - 1))

if !exists {
    newNode.Bitmap |= bit  // Set the bit
    // Insert new child at computed position
} else {
    // Update existing child at position
}

Delete Algorithm

Traverse to the leaf containing the key
Remove the entry from the leaf
If leaf becomes empty, return an empty ref
Propagate changes up the tree:
- If a child becomes empty, remove its bit from the parent’s bitmap
- If an internal node is left with only one child that is a leaf, collapse it by promoting the leaf

Only nodes along the path of a modified entry change:

     root_old          root_new
      /  |  \           /  |  \
     A   B   C         A   B'  C     ← only B' is new
         |                 |
        ...              (modified)

Nodes A and C are reused by reference. Only B and its ancestors up to the root are copied.

Diff Algorithm

The Diff operation performs a parallel traversal of two HAMT roots:

// From internal/hamt/hamt.go
type DiffEntry struct {
    Key      string
    OldValue string  // Empty for additions
    NewValue string  // Empty for deletions
}

Algorithm:

If both nodes are leaves, compare entries directly
If nodes are internal, iterate through all 32 buckets:
- If bucket exists in only one tree, collect all entries as added/removed
- If bucket exists in both, recursively diff the child nodes
Yield a DiffEntry for each difference found

The diff operation is structural. If two subtrees have the same root hash, they are identical and can be skipped entirely.

TransactionalStore

During a backup, thousands of HAMT nodes are created as the tree is built incrementally. Many of these intermediate nodes become unreachable once the tree is fully constructed. The TransactionalStore buffers new nodes in memory and only flushes the reachable subset from the final root:

// From internal/hamt/ (conceptual interface)
type TransactionalStore interface {
    Put(key string, data []byte) error      // Buffer in memory
    Get(key string) ([]byte, error)         // Check buffer, then backing store
    Flush(root string) error                 // BFS from root, flush reachable nodes
}

Flush algorithm:

Start a BFS from the root node
For each node visited, check if it’s in the in-memory buffer
If buffered, write it to the persistent store
Add all child refs to the BFS queue
Discard any buffered nodes that were never visited

This optimization avoids uploading superseded nodes that were created during tree construction but are no longer reachable.

Garbage Collection

The NodeRefs method walks the entire tree and yields every node reference:

// From internal/hamt/hamt.go
func (t *Tree) NodeRefs(root string, fn func(ref string) error) error {
    if root == "" {
        return nil
    }
    return t.nodeRefs(root, fn)
}

func (t *Tree) nodeRefs(ref string, fn func(string) error) error {
    if err := fn(ref); err != nil {
        return err
    }
    node, err := t.loadNode(ref)
    if err != nil {
        return err
    }
    if node.Type == core.ObjectTypeInternal {
        for _, childRef := range node.Children {
            if err := t.nodeRefs(childRef, fn); err != nil {
                return err
            }
        }
    }
    return nil
}

This is used during the mark phase of garbage collection to identify all reachable HAMT nodes from a snapshot root.

Performance Characteristics

Operation	Complexity	Notes
Insert	O(log₃₂ n)	Maximum 6 levels for 1B entries
Lookup	O(log₃₂ n)	Typically 2-3 network requests
Delete	O(log₃₂ n)	Includes potential collapse
Walk	O(n)	Visits every leaf entry once
Diff	O(m)	Where m = changed entries

With packfiles enabled, multiple HAMT nodes are bundled into 8MB packs. An LRU cache reduces the average lookup to 0-1 network requests after the initial pack fetch.

​Overview

​Configuration

​Design Rationale

​Node Types

​Internal Node

​Leaf Node

​Path Key Computation

​Affinity Model (Locality-Preserving Keys)

​Locality Guarantee

​Snapshot Version

​File Moves

​Operations

​Insert Algorithm

​Delete Algorithm

​Structural Sharing

​Diff Algorithm

​TransactionalStore

​Garbage Collection

​Performance Characteristics

Overview

Configuration

Design Rationale

Node Types

Internal Node

Leaf Node

Path Key Computation

Affinity Model (Locality-Preserving Keys)

Locality Guarantee

Snapshot Version

File Moves

Operations

Insert Algorithm

Delete Algorithm

Structural Sharing

Diff Algorithm

TransactionalStore

Garbage Collection

Performance Characteristics