Backup Flow
The backup operation is orchestrated by the BackupManager in internal/engine/. Here’s the complete flow:
1. Initialization
// From AGENTS.md and internal/engine/
1. BackupManager acquires a shared lock
2. Load the previous snapshot (if any) for this source identity
3. Extract the previous HAMT root reference
Source identity is determined by SourceInfo:
type — e.g. “gdrive”, “local”, “onedrive”
account — Google account email, hostname, etc.
path — root folder ID, filesystem path, etc.
Only snapshots matching all three fields are considered “previous” for incremental purposes.
2. Source Scanning
Full scan (for local, sftp, gdrive, onedrive):
source.Walk(func(entry SourceEntry) error {
// Process each file/folder
})
Incremental scan (for gdrive-changes, onedrive-changes):
changeToken := previousSnapshot.ChangeToken
source.WalkChanges(changeToken, func(entry SourceEntry) error {
// Process only changed files
})
For each file encountered:
-
Look up the file ID in the old HAMT:
oldMetaRef, err := hamtTree.Lookup(oldRoot, entry.FileID)
-
Fast-check metadata (name, size, mtime, type, parents):
- If identical and the source doesn’t provide a content hash, carry the old hash forward
- This avoids false-positive diffs for metadata-only changes
-
Determine action:
- Unchanged: Re-insert into new HAMT by reference (structural sharing)
- Changed or new: Queue for upload
3. Upload Phase
Changed/new files are processed by concurrent workers (default: 10 workers):
for _, queuedFile := range uploadQueue {
go func(file SourceEntry) {
// 1. Stream file content
reader := source.Open(file.FileID)
// 2. Content-defined chunking
chunks := fastCDC(reader, minSize, avgSize, maxSize)
// 3. For each chunk:
for chunk := range chunks {
// Compute hash (HMAC-SHA256 or SHA-256)
hash := computeChunkHash(chunk.Data, dedupKey)
// Deduplicate
if store.Exists("chunk/" + hash) {
continue // Skip, already stored
}
// Compress and write
compressed := zstd.Compress(chunk.Data)
store.Put("chunk/" + hash, compressed)
}
// 4. Create content object
contentObj := Content{
Type: "content",
Size: file.Size,
Chunks: chunkRefs,
}
contentHash := sha256(file.RawData)
store.Put("content/" + contentHash, json(contentObj))
// 5. Create filemeta object
fileMeta := FileMeta{
FileID: file.FileID,
Name: file.Name,
Type: file.Type,
ContentHash: contentHash,
Size: file.Size,
Mtime: file.Mtime,
Parents: parentRefs,
...
}
metaRef := "filemeta/" + sha256(json(fileMeta))
store.Put(metaRef, json(fileMeta))
// 6. Insert into new HAMT
newRoot = hamtTree.Insert(newRoot, file.FileID, metaRef)
}(queuedFile)
}
FastCDC Chunking
From internal/engine/chunker.go and the spec:
| Parameter | Value |
|---|
| Min size | 512 KiB |
| Avg size | 1 MiB |
| Max size | 8 MiB |
FastCDC uses a rolling hash to find content-defined boundaries:
- Compute a rolling hash of a 64-byte window
- When the hash matches a pattern (e.g. last 20 bits are zero), create a boundary
- Enforce min/max size constraints
- The final chunk may be smaller than the minimum
Content-defined chunking ensures that inserting bytes at the start of a file doesn’t invalidate all subsequent chunks — only the chunks containing modified data change.
4. HAMT Flush
After all files are uploaded and inserted into the HAMT:
// TransactionalStore buffers nodes in memory
// Only flush the reachable subset from the final root
txnStore.Flush(newRoot)
Flush algorithm (BFS from root):
- Start a queue with the root node ref
- For each node in the queue:
- If it’s in the in-memory buffer, write it to persistent storage
- If it’s an internal node, add all child refs to the queue
- Discard any buffered nodes that were never visited
This avoids uploading intermediate superseded nodes created during tree construction.
5. Snapshot Commit
// 1. Create snapshot object
snapshot := Snapshot{
Version: 1,
Created: time.Now().Format(time.RFC3339),
Root: newRoot,
Seq: previousSeq + 1,
Source: sourceInfo,
ChangeToken: newChangeToken, // For incremental sources
...
}
snapshotRef := "snapshot/" + sha256(json(snapshot))
store.Put(snapshotRef, json(snapshot))
// 2. Update index/latest (the commit point)
index := Index{
LatestSnapshot: snapshotRef,
Seq: snapshot.Seq,
}
store.Put("index/latest", json(index))
// 3. Update index/snapshots catalog
catalog = append(catalog, SnapshotSummary{...})
store.Put("index/snapshots", json(catalog))
// 4. Update index/packs catalog (if packfiles enabled)
// Automatically handled by the PackStore layer
The commit point is updating index/latest. Until this write completes, the previous snapshot remains the “latest” and the repository is in a consistent state.
6. Lock Release
The backup lock is released, allowing other operations to proceed.
Restore Flow
The restore operation is orchestrated by the RestoreManager in internal/engine/.
1. Snapshot Resolution
// Resolve snapshot by:
// - Explicit ref ("snapshot/<hash>")
// - Sequence number (42)
// - "latest" keyword
snapshot := resolveSnapshot(snapshotID)
root := snapshot.Root
2. HAMT Traversal
Walk the HAMT to collect all file metadata entries:
var entries []FileMeta
hamtTree.Walk(root, func(key, metaRef string) error {
// Load filemeta object
metaData := store.Get(metaRef)
meta := json.Unmarshal(metaData)
entries = append(entries, meta)
})
3. Topological Sort
Ensure parent directories are created before their children:
// Build parent dependency graph
graph := buildDependencyGraph(entries)
// Topological sort
sorted := topologicalSort(graph)
This handles Google Drive’s multi-parent semantics where a file can appear in multiple directories.
4. Path Reconstruction
Walk the parent chain of each entry to reconstruct the full relative path:
func buildPath(meta FileMeta) string {
if len(meta.Parents) == 0 {
return meta.Name // Root entry
}
// Load first parent (arbitrary choice if multi-parent)
parentRef := meta.Parents[0]
parentData := store.Get(parentRef)
parentMeta := json.Unmarshal(parentData)
parentPath := buildPath(parentMeta)
return parentPath + "/" + meta.Name
}
For files with multiple parents (Google Drive shared folders), only the first parent is used to construct the primary path. Other parents are ignored during restore.
5. ZIP Archive Creation
Write entries to a ZIP archive in topologically-sorted order:
zipWriter := zip.NewWriter(output)
for _, meta := range sortedEntries {
path := buildPath(meta)
if meta.Type == "folder" {
// Create directory entry
header := &zip.FileHeader{
Name: path + "/",
Method: zip.Store,
}
header.SetModTime(time.Unix(meta.Mtime, 0))
zipWriter.CreateHeader(header)
} else {
// Create file entry
header := &zip.FileHeader{
Name: path,
Method: zip.Deflate,
}
header.SetModTime(time.Unix(meta.Mtime, 0))
writer, _ := zipWriter.CreateHeader(header)
// Load content object
contentData := store.Get("content/" + meta.ContentHash)
content := json.Unmarshal(contentData)
// Stream chunks
for _, chunkRef := range content.Chunks {
// Fetch and decompress chunk
compressedData := store.Get(chunkRef)
rawData := zstd.Decompress(compressedData)
writer.Write(rawData)
}
}
}
zipWriter.Close()
6. Output
The ZIP archive is:
- Written to stdout (CLI)
- Returned as a byte stream (web API)
- Saved to a file (with
-o flag)
Concurrent Upload
The backup manager uses a worker pool (default: 10 concurrent workers) to parallelize file uploads:
workerPool := make(chan struct{}, 10) // Semaphore
for _, file := range uploadQueue {
workerPool <- struct{}{} // Acquire
go func(f SourceEntry) {
defer func() { <-workerPool }() // Release
processFile(f)
}(file)
}
Chunk-Level Deduplication
Before writing a chunk:
if store.Exists("chunk/" + hash) {
continue // Skip, already stored
}
The KeyCacheStore layer caches existence checks in a local bbolt database to avoid redundant Exists calls:
- First check: network request → cache result
- Subsequent checks: local cache hit (0 network requests)
Content-Level Deduplication
Before streaming a file, check if its content object already exists:
if store.Exists("content/" + file.ContentHash) {
// Entire file upload skipped!
// Only create new filemeta and HAMT nodes
}
For sources that provide content hashes (Google Drive MD5, OneDrive SHA1), this is a huge optimization.
Packfile Bundling
Small objects (< 512KB) are bundled into 8MB packfiles:
- Reduces API calls from thousands to dozens
- LRU cache (128MB) keeps hot packs in memory
- Typical metadata read: 0-1 network requests after initial pack fetch
Structural Sharing
Unchanged files reuse their filemeta refs, and unchanged subtrees reuse their HAMT node refs:
Snapshot 1: 10,000 files → 10,000 filemeta + ~500 HAMT nodes
Snapshot 2: 10 files changed → 10 new filemeta + ~5 new HAMT nodes
Incremental storage: ~15 objects, not 10,510
Error Handling
Transient Errors
Network errors during upload are retried with exponential backoff:
retry.Do(
func() error { return store.Put(key, data) },
retry.Attempts(3),
retry.Delay(time.Second),
retry.DelayType(retry.BackOffDelay),
)
Permanent Errors
Permanent errors (authentication failure, permission denied) abort the backup immediately:
if isPermanentError(err) {
return fmt.Errorf("backup failed: %w", err)
}
Partial Upload Cleanup
If a backup is interrupted, orphaned objects remain in the store but are not reachable from any snapshot.
Running prune after an interrupted backup will:
- Mark all reachable objects from existing snapshots
- Sweep and delete any orphaned objects
- Repack fragmented packfiles
Orphaned objects are harmless — they consume storage but don’t affect backup correctness. Prune reclaims this space.
Diff Operation
The diff command leverages the HAMT’s structural diff:
hamtTree.Diff(snapshot1.Root, snapshot2.Root, func(entry DiffEntry) error {
if entry.OldValue == "" {
fmt.Printf("+ %s\n", entry.Key) // Added
} else if entry.NewValue == "" {
fmt.Printf("- %s\n", entry.Key) // Removed
} else {
fmt.Printf("M %s\n", entry.Key) // Modified
}
})
The diff is structural: if two subtrees have the same root hash, they’re identical and the traversal skips them entirely. This makes diff extremely fast even for snapshots with millions of files.