casq_core 0.2.2

A minimal content-addressed file store using BLAKE3. (library)
Documentation

casq_core

A content-addressed file store (CAS) library using BLAKE3 hashing.

Overview

casq_core is a Rust library that provides the core functionality for casq, a minimal, single-binary content-addressed storage system. It stores files and directories by their cryptographic hash, ensuring immutable, deduplicated storage with built-in garbage collection.

Think of it as a minimal git object store or restic backend, but generic and simple.

Features

  • Content-Addressed Storage - Files and directories stored by BLAKE3 hash
  • Automatic Deduplication - Identical content stored only once
  • Tree-Based Directories - Canonical ordering ensures stable hashes
  • Atomic Operations - Tempfile-based writes prevent corruption
  • Garbage Collection - Mark & sweep algorithm from reference roots
  • Corruption Detection - Hash verification on all reads
  • Named References - GC roots for preserving important snapshots
  • Full Round-Trip - Add → Store → GC → Materialize
  • Cross-Platform - Unix permissions preserved, Windows supported
  • Gitignore Support - Respects .gitignore during filesystem walks

Quick Start

use casq_core::{Store, Algorithm};
use std::path::Path;

// Initialize a new store
let store = Store::init("./my-store", Algorithm::Blake3)?;

// Add a file or directory
let hash = store.add_path(Path::new("./my-data"))?;

// Create a named reference (GC root)
store.refs().add("backup-2024", &hash)?;

// Garbage collect unreferenced objects
let stats = store.gc(false)?;
println!("Deleted {} objects, freed {} bytes",
         stats.objects_deleted, stats.bytes_freed);

// Materialize back to filesystem
store.materialize(&hash, Path::new("./restored"))?;

Architecture

Storage Format

Objects are stored with a 16-byte header followed by the payload:

0x00  4   "CAFS" magic
0x04  1   version (u8)
0x05  1   type: 1=blob, 2=tree
0x06  1   algo: 1=blake3-256
0x07  1   reserved (must be 0)
0x08  8   payload_len (u64 LE)
0x10  ... payload

Directory Structure

$STORE_ROOT/
  config               # Store configuration (version, algorithm)
  objects/
    blake3-256/        # Algorithm-specific directory
      ab/              # First 2 hex chars (shard)
        abcd...ef      # Remaining 62 hex chars (object file)
  refs/                # Named references (GC roots)
    backup-name

Object Types

  1. Blob - Raw file content (hash of payload only)
  2. Tree - Directory structure (sorted entries by name for canonical hashing)

Module Structure

casq_core/src/
├── lib.rs       - Public API and documentation
├── error.rs     - Error types with thiserror
├── hash.rs      - BLAKE3 hashing (32-byte digests)
├── object.rs    - Binary object encoding/decoding
├── store.rs     - Store management and object I/O
├── tree.rs      - Tree entry encoding with canonical sorting
├── walk.rs      - Filesystem traversal with gitignore
├── gc.rs        - Garbage collection (mark & sweep)
└── refs.rs      - Reference management

Building and Testing

# Build the library
cargo build --release -p casq_core

# Run all tests (68 unit tests + 1 doctest)
cargo test -p casq_core

# Run with output
cargo test -p casq_core -- --nocapture

# Check code quality
cargo clippy -p casq_core -- -D warnings

# Format code
cargo fmt -p casq_core

API Overview

Core Types

  • Store - Main store interface
  • Hash - 32-byte BLAKE3 hash wrapper
  • TreeEntry - File/directory entry in a tree
  • RefManager - Manages named references
  • GcStats - Garbage collection statistics

Main Operations

// Store initialization
let store = Store::init(path, Algorithm::Blake3)?;
let store = Store::open(path)?;

// Object storage
let hash = store.put_blob(reader)?;
let hash = store.put_tree(entries)?;
let hash = store.add_path(path)?;  // Recursively add file/dir

// Object retrieval
let data = store.get_blob(&hash)?;
let entries = store.get_tree(&hash)?;
store.cat_blob(&hash, writer)?;

// Materialization
store.materialize(&hash, dest_path)?;

// References
store.refs().add(name, &hash)?;
let hash = store.refs().get(name)?;
let all_refs = store.refs().list()?;
store.refs().remove(name)?;

// Garbage collection
let stats = store.gc(dry_run)?;

Design Principles

  1. Content-Addressed - Objects are immutable and identified by hash
  2. Canonical Hashing - Tree entries sorted by name for stable hashes
  3. Atomic Writes - Use tempfile for corruption-free operations
  4. Simple Format - Binary format with clear headers, human-inspectable paths
  5. Minimal Dependencies - Only essential crates (blake3, hex, tempfile, ignore, thiserror)
  6. No Network - Local-only, single-user design (MVP scope)

Hashing Rules

  • Blob hash: hash = blake3(payload_bytes) (payload only, not header)
  • Tree hash: Hash of canonicalized entries (sorted by name, bytewise UTF-8)
  • Object path: objects/<algo>/<prefix>/<suffix> where prefix is first 2 hex chars

Garbage Collection

  • Refs are GC roots stored in refs/ directory
  • Mark phase traverses from all refs, recursively following tree entries
  • Sweep phase deletes objects not in the reachable set
  • Dry-run mode available for safe preview before deletion

Test Coverage

✓ 68 unit tests passing
✓ 1 doctest passing
✓ 100% core functionality coverage
✓ Edge cases: corruption, empty files/dirs, large files, permissions
✓ Round-trip testing: add → store → materialize → verify

Test Categories

  • Hash operations - Encoding, decoding, validation
  • Object encoding - Headers, payload, corruption detection
  • Store operations - Init, open, blob/tree storage
  • Tree operations - Canonical ordering, nested structures
  • Filesystem walking - Files, directories, permissions, gitignore
  • References - CRUD operations, validation
  • Garbage collection - Mark, sweep, dry-run, tree reachability
  • Materialization - Blobs, trees, nested structures, permissions

Limitations (MVP Scope)

The following are intentionally not supported in the current MVP:

  • ❌ Network operations (remote stores)
  • ❌ Multi-user/concurrent access
  • ❌ Content chunking or deduplication
  • ❌ Compression
  • ❌ Encryption
  • ❌ Symbolic links
  • ❌ Special file types (devices, sockets, etc.)
  • ❌ Extended attributes or ACLs beyond basic POSIX permissions

Performance Characteristics

  • Hash algorithm: BLAKE3 (fast, cryptographically secure)
  • I/O: Streaming for large files (no full buffering)
  • Deduplication: Automatic via content addressing
  • GC: Mark & sweep with efficient hash set operations
  • Directory sharding: First 2 hex chars prevent filesystem bottlenecks

Error Handling

All operations return Result<T, Error> with detailed error types:

  • IoError - File system operations
  • CorruptedObject - Hash mismatch or invalid format
  • InvalidHash - Malformed hash string
  • ObjectNotFound - Missing object in store
  • InvalidStore - Store not initialized or corrupted config
  • InvalidRef - Bad reference name or format
  • PathExists - Destination already exists (materialization)

Dependencies

blake3 = "1.5"      # BLAKE3 hashing
hex = "0.4"         # Hash hex encoding/decoding
tempfile = "3.0"    # Atomic object writes
ignore = "0.4"      # Filesystem walking with .gitignore support
thiserror = "2.0"   # Error handling

Contributing

This library is part of the casq project. When contributing:

  1. Ensure all tests pass: cargo test -p casq_core
  2. Maintain clippy cleanliness: cargo clippy -p casq_core -- -D warnings
  3. Format code: cargo fmt -p casq_core
  4. Add tests for new functionality
  5. Update documentation

License

See the workspace LICENSE file for details.

See Also

  • casq - CLI binary using this library (coming soon)
  • NOTES.md - Detailed design and specification
  • CLAUDE.md - Development guidelines for AI assistants