casq_core
A production-ready content-addressed file store (CAS) library with compression and chunking (v0.4.0).
Overview
casq_core is a Rust library that provides the core functionality for casq, a content-addressed storage system with modern efficiency features. It stores files and directories by their cryptographic hash, ensuring immutable, deduplicated storage with transparent compression, content-defined chunking, and built-in garbage collection.
Think of it as a minimal git object store or restic backend, but with compression and chunking for storage efficiency.
Features
- ✅ Content-Addressed Storage - Files and directories stored by BLAKE3 hash
- ✅ Transparent Compression - 3-5x storage reduction with zstd (files ≥ 4KB)
- ✅ Content-Defined Chunking - FastCDC for incremental backups (files ≥ 1MB)
- ✅ Automatic Deduplication - Identical content stored only once (including chunk-level)
- ✅ Tree-Based Directories - Canonical ordering ensures stable hashes
- ✅ Atomic Operations - Tempfile-based writes prevent corruption
- ✅ Garbage Collection - Mark & sweep algorithm handles all object types
- ✅ Corruption Detection - Hash verification on all reads
- ✅ Named References - GC roots for preserving important snapshots
- ✅ Full Round-Trip - Add → Store → GC → Materialize
- ✅ Cross-Platform - Unix permissions preserved, Windows supported
- ✅ Gitignore Support - Respects
.gitignoreduring filesystem walks
Quick Start
use ;
use Path;
// Initialize a new store
let store = init?;
// Add a file or directory
let hash = store.add_path?;
// Create a named reference (GC root)
store.refs.add?;
// Garbage collect unreferenced objects
let stats = store.gc?;
println!;
// Materialize back to filesystem
store.materialize?;
Architecture
Storage Format
Objects are stored with a 16-byte header followed by the payload.
Object format:
0x00 4 "CAFS" magic
0x04 1 version (u8) = 2
0x05 1 type: 1=blob, 2=tree, 3=chunk_list
0x06 1 algo: 1=blake3-256
0x07 1 compression: 0=none, 1=zstd
0x08 8 payload_len (u64 LE) - compressed size if compressed
0x10 ... payload (possibly compressed)
Directory Structure
$STORE_ROOT/
config # Store configuration (version, algorithm)
objects/
blake3-256/ # Algorithm-specific directory
ab/ # First 2 hex chars (shard)
abcd...ef # Remaining 62 hex chars (object file)
refs/ # Named references (GC roots)
backup-name
Object Types
- Blob - Raw file content (automatically compressed if ≥ 4KB, hash of uncompressed payload)
- Tree - Directory structure (sorted entries by name for canonical hashing)
- ChunkList - Large file metadata (files ≥ 1MB split into variable-size chunks using FastCDC)
Module Structure
casq_core/src/
├── lib.rs - Public API and documentation
├── error.rs - Error types with thiserror
├── hash.rs - BLAKE3 hashing (32-byte digests)
├── object.rs - Binary object encoding/decoding
├── chunking.rs - Content-defined chunking with FastCDC (v0.4.0+)
├── store.rs - Store management with compression/chunking
├── tree.rs - Tree entry encoding with canonical sorting
├── walk.rs - Filesystem traversal with gitignore
├── gc.rs - Garbage collection (mark & sweep, handles all object types)
├── refs.rs - Reference management
└── journal.rs - Operation journal
Building and Testing
# Build the library
# Run all tests (121 unit tests + 23 property tests + 1 doctest)
# Run only property tests
# Run with output
# Check code quality
# Format code
API Overview
Core Types
Store- Main store interfaceHash- 32-byte BLAKE3 hash wrapperTreeEntry- File/directory entry in a treeRefManager- Manages named referencesGcStats- Garbage collection statistics
Main Operations
// Store initialization
let store = init?;
let store = open?;
// Object storage
let hash = store.put_blob?;
let hash = store.put_tree?;
let hash = store.add_path?; // Recursively add file/dir
// Object retrieval
let data = store.get_blob?;
let entries = store.get_tree?;
store.cat_blob?;
// Materialization
store.materialize?;
// References
store.refs.add?;
let hash = store.refs.get?;
let all_refs = store.refs.list?;
store.refs.remove?;
// Garbage collection
let stats = store.gc?;
Design Principles
- Content-Addressed - Objects are immutable and identified by hash
- Canonical Hashing - Tree entries sorted by name for stable hashes
- Transparent Optimization - Compression and chunking automatic, invisible to API consumers
- Atomic Writes - Use tempfile for corruption-free operations
- Simple Format - Binary format with clear headers, human-inspectable paths
- Efficient Storage - 3-5x compression typical, incremental backups via chunking
- Local-Only - Single-user design, no network features
Hashing Rules
- Blob hash:
hash = blake3(uncompressed_payload_bytes)(payload only, not header) - ChunkList hash:
hash = blake3(original_file_bytes)(not the ChunkList metadata) - Tree hash: Hash of canonicalized entries (sorted by name, bytewise UTF-8)
- Object path:
objects/<algo>/<prefix>/<suffix>where prefix is first 2 hex chars - Important: Hashes are stable regardless of compression/chunking
Garbage Collection
- Refs are GC roots stored in
refs/directory - Mark phase traverses from all refs, recursively following tree entries
- Sweep phase deletes objects not in the reachable set
- Dry-run mode available for safe preview before deletion
Test Coverage
✓ 121 unit tests passing (100% pass rate)
✓ 23 property tests (generative invariant verification)
✓ 1 doctest passing
✓ 100% core functionality coverage
✓ Edge cases: corruption, empty files/dirs, large files, permissions
✓ Round-trip testing: add → store → materialize → verify
✓ Compression/chunking: thresholds, boundaries, deduplication
Test Categories
Unit Tests:
- Hash operations - Encoding, decoding, validation
- Object encoding - Headers, payload, compression types
- Chunking - FastCDC boundaries, deterministic chunking, small files
- Store operations - Init, open, blob/tree/chunklist storage, compression
- Tree operations - Canonical ordering, nested structures
- Filesystem walking - Files, directories, permissions, gitignore
- References - CRUD operations, validation
- Garbage collection - Mark, sweep, dry-run, all object types including chunks
- Materialization - Blobs, trees, chunked files, nested structures, permissions
Property Tests:
- Hash determinism - Hashing same data always produces same result
- Serialization round-trips - All binary formats (headers, chunks, trees)
- Compression identity - Compress/decompress preserves data
- Chunking invariants - Size bounds, determinism, total size preservation
- Tree canonicalization - Order-independent hashing
- GC correctness - Preserves referenced, deletes unreferenced, idempotent
- Ref validation - Valid names accepted
Limitations (By Design)
The following are intentionally not supported:
- ❌ Network operations (remote stores)
- ❌ Multi-user/concurrent access
- ❌ Encryption (planned for future)
- ❌ Parallel operations (single-threaded)
- ❌ Symbolic links
- ❌ Special file types (devices, sockets, etc.)
- ❌ Extended attributes or ACLs beyond basic POSIX permissions
Note: Compression and chunking are now supported in v0.4.0+
Performance Characteristics
- Hash algorithm: BLAKE3 (fast, cryptographically secure)
- Compression: Zstd level 3 (~500 MB/s compression/decompression)
- Chunking: FastCDC (~1 GB/s processing)
- I/O: Streaming for large files (no full buffering)
- Deduplication: Automatic via content addressing (including chunk-level)
- GC: Mark & sweep with efficient hash set operations (handles all object types)
- Directory sharding: First 2 hex chars prevent filesystem bottlenecks
Storage Efficiency:
- Compression: 3-5x reduction for text files, 2-3x for mixed data
- Chunking: Change 1 byte in 1GB file → store only ~512KB (changed chunk)
- Cross-file deduplication: Shared content stored only once
Error Handling
All operations return Result<T, Error> with detailed error types:
IoError- File system operationsCorruptedObject- Hash mismatch or invalid formatInvalidHash- Malformed hash stringObjectNotFound- Missing object in storeInvalidStore- Store not initialized or corrupted configInvalidRef- Bad reference name or formatPathExists- Destination already exists (materialization)
Dependencies
= "1.5" # BLAKE3 hashing
= "0.4" # Hash hex encoding/decoding
= "3.0" # Atomic object writes
= "0.4" # Filesystem walking with .gitignore support
= "2.0" # Error handling
= "0.13" # Transparent compression (v0.4.0+)
= "3.1" # Content-defined chunking (v0.4.0+)
Contributing
This library is part of the casq project. When contributing:
- Ensure all tests pass:
cargo test -p casq_core - Maintain clippy cleanliness:
cargo clippy -p casq_core -- -D warnings - Format code:
cargo fmt -p casq_core - Add tests for new functionality
- Update documentation
License
Apache-2.0
See Also
- casq - CLI binary using this library
- NOTES.md - Detailed design and specification
- CLAUDE.md - Development guidelines for AI assistants