casq_core 0.12.0

A minimal content-addressed file store using BLAKE3. (library)
Documentation
# casq_core

A production-ready content-addressed file store (CAS) library with compression and chunking (v0.4.0).

## Overview

`casq_core` is a Rust library that provides the core functionality for **casq**, a content-addressed storage system with modern efficiency features. It stores files and directories by their cryptographic hash, ensuring immutable, deduplicated storage with transparent compression, content-defined chunking, and built-in garbage collection.

Think of it as a minimal git object store or restic backend, but with compression and chunking for storage efficiency.

## Features

- **Content-Addressed Storage** - Files and directories stored by BLAKE3 hash
-**Transparent Compression** - 3-5x storage reduction with zstd (files ≥ 4KB)
-**Content-Defined Chunking** - FastCDC v2020 for incremental backups (files ≥ 1MB, 60-80% chunk reuse after edits)
-**Automatic Deduplication** - Identical content stored only once (including chunk-level)
-**Tree-Based Directories** - Canonical ordering ensures stable hashes
-**Atomic Operations** - Tempfile-based writes prevent corruption
-**Garbage Collection** - Mark & sweep algorithm handles all object types
-**Corruption Detection** - Hash verification on all reads
-**Named References** - GC roots for preserving important snapshots
-**Full Round-Trip** - Add → Store → GC → Materialize
-**Cross-Platform** - Unix permissions preserved, Windows supported
-**Gitignore Support** - Respects `.gitignore` during filesystem walks

## Quick Start

```rust
use casq_core::{Store, Algorithm};
use std::path::Path;

// Initialize a new store
let store = Store::init("./my-store", Algorithm::Blake3)?;

// Add a file or directory
let hash = store.add_path(Path::new("./my-data"))?;

// Create a named reference (GC root)
store.refs().add("backup-2024", &hash)?;

// Garbage collect unreferenced objects
let stats = store.gc(false)?;
println!("Deleted {} objects, freed {} bytes",
         stats.objects_deleted, stats.bytes_freed);

// Materialize back to filesystem
store.materialize(&hash, Path::new("./restored"))?;
```

## Architecture

### Storage Format

Objects are stored with a 16-byte header followed by the payload.

**Object format:**
```
0x00  4   "CAFS" magic
0x04  1   version (u8) = 2
0x05  1   type: 1=blob, 2=tree, 3=chunk_list
0x06  1   algo: 1=blake3-256
0x07  1   compression: 0=none, 1=zstd
0x08  8   payload_len (u64 LE) - compressed size if compressed
0x10  ... payload (possibly compressed)
```

### Directory Structure

```
$STORE_ROOT/
  config               # Store configuration (version, algorithm)
  objects/
    blake3-256/        # Algorithm-specific directory
      ab/              # First 2 hex chars (shard)
        abcd...ef      # Remaining 62 hex chars (object file)
  refs/                # Named references (GC roots)
    backup-name
```

### Object Types

1. **Blob** - Raw file content (automatically compressed if ≥ 4KB, hash of uncompressed payload)
2. **Tree** - Directory structure (sorted entries by name for canonical hashing)
3. **ChunkList** - Large file metadata (files ≥ 1MB split into variable-size chunks using FastCDC v2020)

### Module Structure

```
casq_core/src/
├── lib.rs       - Public API and documentation
├── error.rs     - Error types with thiserror
├── hash.rs      - BLAKE3 hashing (32-byte digests)
├── object.rs    - Binary object encoding/decoding
├── chunking.rs  - Content-defined chunking with FastCDC v2020 (v0.4.0+)
├── store.rs     - Store management with compression/chunking
├── tree.rs      - Tree entry encoding with canonical sorting
├── walk.rs      - Filesystem traversal with gitignore
├── gc.rs        - Garbage collection (mark & sweep, handles all object types)
├── refs.rs      - Reference management
└── journal.rs   - Operation journal
```

## Building and Testing

```bash
# Build the library
cargo build --release -p casq_core

# Run all tests (121 unit tests + 23 property tests + 1 doctest)
cargo test -p casq_core

# Run only property tests
cargo test -p casq_core prop

# Run with output
cargo test -p casq_core -- --nocapture

# Check code quality
cargo clippy -p casq_core -- -D warnings

# Format code
cargo fmt -p casq_core
```

## API Overview

### Core Types

- `Store` - Main store interface
- `Hash` - 32-byte BLAKE3 hash wrapper
- `TreeEntry` - File/directory entry in a tree
- `RefManager` - Manages named references
- `GcStats` - Garbage collection statistics

### Main Operations

```rust
// Store initialization
let store = Store::init(path, Algorithm::Blake3)?;
let store = Store::open(path)?;

// Object storage
let hash = store.put_blob(reader)?;
let hash = store.put_tree(entries)?;
let hash = store.add_path(path)?;  // Recursively add file/dir

// Object retrieval
let data = store.get_blob(&hash)?;
let entries = store.get_tree(&hash)?;
store.cat_blob(&hash, writer)?;

// Materialization
store.materialize(&hash, dest_path)?;

// References
store.refs().add(name, &hash)?;
let hash = store.refs().get(name)?;
let all_refs = store.refs().list()?;
store.refs().remove(name)?;

// Garbage collection
let stats = store.gc(dry_run)?;
```

## Design Principles

1. **Content-Addressed** - Objects are immutable and identified by hash
2. **Canonical Hashing** - Tree entries sorted by name for stable hashes
3. **Transparent Optimization** - Compression and chunking automatic, invisible to API consumers
4. **Atomic Writes** - Use tempfile for corruption-free operations
5. **Simple Format** - Binary format with clear headers, human-inspectable paths
6. **Efficient Storage** - 3-5x compression typical, incremental backups via chunking
7. **Local-Only** - Single-user design, no network features

## Hashing Rules

- **Blob hash**: `hash = blake3(uncompressed_payload_bytes)` (payload only, not header)
- **ChunkList hash**: `hash = blake3(original_file_bytes)` (not the ChunkList metadata)
- **Tree hash**: Hash of canonicalized entries (sorted by name, bytewise UTF-8)
- **Object path**: `objects/<algo>/<prefix>/<suffix>` where prefix is first 2 hex chars
- **Important**: Hashes are stable regardless of compression/chunking

## Garbage Collection

- **Refs** are GC roots stored in `refs/` directory
- **Mark phase** traverses from all refs, recursively following tree entries
- **Sweep phase** deletes objects not in the reachable set
- **Dry-run mode** available for safe preview before deletion

## Test Coverage

```
✓ 121 unit tests passing (100% pass rate)
✓ 23 property tests (generative invariant verification)
✓ 1 doctest passing
✓ 100% core functionality coverage
✓ Edge cases: corruption, empty files/dirs, large files, permissions
✓ Round-trip testing: add → store → materialize → verify
✓ Compression/chunking: thresholds, boundaries, deduplication
```

### Test Categories

**Unit Tests:**
- **Hash operations** - Encoding, decoding, validation
- **Object encoding** - Headers, payload, compression types
- **Chunking** - FastCDC boundaries, deterministic chunking, small files
- **Store operations** - Init, open, blob/tree/chunklist storage, compression
- **Tree operations** - Canonical ordering, nested structures
- **Filesystem walking** - Files, directories, permissions, gitignore
- **References** - CRUD operations, validation
- **Garbage collection** - Mark, sweep, dry-run, all object types including chunks
- **Materialization** - Blobs, trees, chunked files, nested structures, permissions

**Property Tests:**
- **Hash determinism** - Hashing same data always produces same result
- **Serialization round-trips** - All binary formats (headers, chunks, trees)
- **Compression identity** - Compress/decompress preserves data
- **Chunking invariants** - Size bounds, determinism, total size preservation
- **Tree canonicalization** - Order-independent hashing
- **GC correctness** - Preserves referenced, deletes unreferenced, idempotent
- **Ref validation** - Valid names accepted

## Limitations (By Design)

The following are intentionally **not supported**:

- ❌ Network operations (remote stores)
- ❌ Multi-user/concurrent access
- ❌ Encryption (planned for future)
- ❌ Parallel operations (single-threaded)
- ❌ Symbolic links
- ❌ Special file types (devices, sockets, etc.)
- ❌ Extended attributes or ACLs beyond basic POSIX permissions

**Note**: Compression and chunking are now supported in v0.4.0+

## Performance Characteristics

- **Hash algorithm**: BLAKE3 (fast, cryptographically secure)
- **Compression**: Zstd level 3 (~500 MB/s compression/decompression)
- **Chunking**: FastCDC (~1 GB/s processing)
- **I/O**: Streaming for large files (no full buffering)
- **Deduplication**: Automatic via content addressing (including chunk-level)
- **GC**: Mark & sweep with efficient hash set operations (handles all object types)
- **Directory sharding**: First 2 hex chars prevent filesystem bottlenecks

**Storage Efficiency:**
- Compression: 3-5x reduction for text files, 2-3x for mixed data
- Chunking: Change 1 byte in 1GB file → store only ~512KB (changed chunk)
- Cross-file deduplication: Shared content stored only once

## Error Handling

All operations return `Result<T, Error>` with detailed error types:

- `IoError` - File system operations
- `CorruptedObject` - Hash mismatch or invalid format
- `InvalidHash` - Malformed hash string
- `ObjectNotFound` - Missing object in store
- `InvalidStore` - Store not initialized or corrupted config
- `InvalidRef` - Bad reference name or format
- `PathExists` - Destination already exists (materialization)

## Dependencies

```toml
blake3 = "1.5"      # BLAKE3 hashing
hex = "0.4"         # Hash hex encoding/decoding
tempfile = "3.0"    # Atomic object writes
ignore = "0.4"      # Filesystem walking with .gitignore support
thiserror = "2.0"   # Error handling
zstd = "0.13"       # Transparent compression (v0.4.0+)
fastcdc = "3.1"     # Content-defined chunking (v0.4.0+)
```

## Contributing

This library is part of the **casq** project. When contributing:

1. Ensure all tests pass: `cargo test -p casq_core`
2. Maintain clippy cleanliness: `cargo clippy -p casq_core -- -D warnings`
3. Format code: `cargo fmt -p casq_core`
4. Add tests for new functionality
5. Update documentation

## License

> Apache-2.0

## See Also

- [**casq**]https://crates.io/crates/casq - CLI binary using this library
- **NOTES.md** - Detailed design and specification
- **CLAUDE.md** - Development guidelines for AI assistants