casq_core 0.2.2

A minimal content-addressed file store using BLAKE3. (library)
Documentation
# casq_core

A content-addressed file store (CAS) library using BLAKE3 hashing.

## Overview

`casq_core` is a Rust library that provides the core functionality for **casq**, a minimal, single-binary content-addressed storage system. It stores files and directories by their cryptographic hash, ensuring immutable, deduplicated storage with built-in garbage collection.

Think of it as a minimal git object store or restic backend, but generic and simple.

## Features

- **Content-Addressed Storage** - Files and directories stored by BLAKE3 hash
-**Automatic Deduplication** - Identical content stored only once
-**Tree-Based Directories** - Canonical ordering ensures stable hashes
-**Atomic Operations** - Tempfile-based writes prevent corruption
-**Garbage Collection** - Mark & sweep algorithm from reference roots
-**Corruption Detection** - Hash verification on all reads
-**Named References** - GC roots for preserving important snapshots
-**Full Round-Trip** - Add → Store → GC → Materialize
-**Cross-Platform** - Unix permissions preserved, Windows supported
-**Gitignore Support** - Respects `.gitignore` during filesystem walks

## Quick Start

```rust
use casq_core::{Store, Algorithm};
use std::path::Path;

// Initialize a new store
let store = Store::init("./my-store", Algorithm::Blake3)?;

// Add a file or directory
let hash = store.add_path(Path::new("./my-data"))?;

// Create a named reference (GC root)
store.refs().add("backup-2024", &hash)?;

// Garbage collect unreferenced objects
let stats = store.gc(false)?;
println!("Deleted {} objects, freed {} bytes",
         stats.objects_deleted, stats.bytes_freed);

// Materialize back to filesystem
store.materialize(&hash, Path::new("./restored"))?;
```

## Architecture

### Storage Format

Objects are stored with a 16-byte header followed by the payload:

```
0x00  4   "CAFS" magic
0x04  1   version (u8)
0x05  1   type: 1=blob, 2=tree
0x06  1   algo: 1=blake3-256
0x07  1   reserved (must be 0)
0x08  8   payload_len (u64 LE)
0x10  ... payload
```

### Directory Structure

```
$STORE_ROOT/
  config               # Store configuration (version, algorithm)
  objects/
    blake3-256/        # Algorithm-specific directory
      ab/              # First 2 hex chars (shard)
        abcd...ef      # Remaining 62 hex chars (object file)
  refs/                # Named references (GC roots)
    backup-name
```

### Object Types

1. **Blob** - Raw file content (hash of payload only)
2. **Tree** - Directory structure (sorted entries by name for canonical hashing)

### Module Structure

```
casq_core/src/
├── lib.rs       - Public API and documentation
├── error.rs     - Error types with thiserror
├── hash.rs      - BLAKE3 hashing (32-byte digests)
├── object.rs    - Binary object encoding/decoding
├── store.rs     - Store management and object I/O
├── tree.rs      - Tree entry encoding with canonical sorting
├── walk.rs      - Filesystem traversal with gitignore
├── gc.rs        - Garbage collection (mark & sweep)
└── refs.rs      - Reference management
```

## Building and Testing

```bash
# Build the library
cargo build --release -p casq_core

# Run all tests (68 unit tests + 1 doctest)
cargo test -p casq_core

# Run with output
cargo test -p casq_core -- --nocapture

# Check code quality
cargo clippy -p casq_core -- -D warnings

# Format code
cargo fmt -p casq_core
```

## API Overview

### Core Types

- `Store` - Main store interface
- `Hash` - 32-byte BLAKE3 hash wrapper
- `TreeEntry` - File/directory entry in a tree
- `RefManager` - Manages named references
- `GcStats` - Garbage collection statistics

### Main Operations

```rust
// Store initialization
let store = Store::init(path, Algorithm::Blake3)?;
let store = Store::open(path)?;

// Object storage
let hash = store.put_blob(reader)?;
let hash = store.put_tree(entries)?;
let hash = store.add_path(path)?;  // Recursively add file/dir

// Object retrieval
let data = store.get_blob(&hash)?;
let entries = store.get_tree(&hash)?;
store.cat_blob(&hash, writer)?;

// Materialization
store.materialize(&hash, dest_path)?;

// References
store.refs().add(name, &hash)?;
let hash = store.refs().get(name)?;
let all_refs = store.refs().list()?;
store.refs().remove(name)?;

// Garbage collection
let stats = store.gc(dry_run)?;
```

## Design Principles

1. **Content-Addressed** - Objects are immutable and identified by hash
2. **Canonical Hashing** - Tree entries sorted by name for stable hashes
3. **Atomic Writes** - Use tempfile for corruption-free operations
4. **Simple Format** - Binary format with clear headers, human-inspectable paths
5. **Minimal Dependencies** - Only essential crates (blake3, hex, tempfile, ignore, thiserror)
6. **No Network** - Local-only, single-user design (MVP scope)

## Hashing Rules

- **Blob hash**: `hash = blake3(payload_bytes)` (payload only, not header)
- **Tree hash**: Hash of canonicalized entries (sorted by name, bytewise UTF-8)
- **Object path**: `objects/<algo>/<prefix>/<suffix>` where prefix is first 2 hex chars

## Garbage Collection

- **Refs** are GC roots stored in `refs/` directory
- **Mark phase** traverses from all refs, recursively following tree entries
- **Sweep phase** deletes objects not in the reachable set
- **Dry-run mode** available for safe preview before deletion

## Test Coverage

```
✓ 68 unit tests passing
✓ 1 doctest passing
✓ 100% core functionality coverage
✓ Edge cases: corruption, empty files/dirs, large files, permissions
✓ Round-trip testing: add → store → materialize → verify
```

### Test Categories

- **Hash operations** - Encoding, decoding, validation
- **Object encoding** - Headers, payload, corruption detection
- **Store operations** - Init, open, blob/tree storage
- **Tree operations** - Canonical ordering, nested structures
- **Filesystem walking** - Files, directories, permissions, gitignore
- **References** - CRUD operations, validation
- **Garbage collection** - Mark, sweep, dry-run, tree reachability
- **Materialization** - Blobs, trees, nested structures, permissions

## Limitations (MVP Scope)

The following are intentionally **not supported** in the current MVP:

- ❌ Network operations (remote stores)
- ❌ Multi-user/concurrent access
- ❌ Content chunking or deduplication
- ❌ Compression
- ❌ Encryption
- ❌ Symbolic links
- ❌ Special file types (devices, sockets, etc.)
- ❌ Extended attributes or ACLs beyond basic POSIX permissions

## Performance Characteristics

- **Hash algorithm**: BLAKE3 (fast, cryptographically secure)
- **I/O**: Streaming for large files (no full buffering)
- **Deduplication**: Automatic via content addressing
- **GC**: Mark & sweep with efficient hash set operations
- **Directory sharding**: First 2 hex chars prevent filesystem bottlenecks

## Error Handling

All operations return `Result<T, Error>` with detailed error types:

- `IoError` - File system operations
- `CorruptedObject` - Hash mismatch or invalid format
- `InvalidHash` - Malformed hash string
- `ObjectNotFound` - Missing object in store
- `InvalidStore` - Store not initialized or corrupted config
- `InvalidRef` - Bad reference name or format
- `PathExists` - Destination already exists (materialization)

## Dependencies

```toml
blake3 = "1.5"      # BLAKE3 hashing
hex = "0.4"         # Hash hex encoding/decoding
tempfile = "3.0"    # Atomic object writes
ignore = "0.4"      # Filesystem walking with .gitignore support
thiserror = "2.0"   # Error handling
```

## Contributing

This library is part of the **casq** project. When contributing:

1. Ensure all tests pass: `cargo test -p casq_core`
2. Maintain clippy cleanliness: `cargo clippy -p casq_core -- -D warnings`
3. Format code: `cargo fmt -p casq_core`
4. Add tests for new functionality
5. Update documentation

## License

See the workspace LICENSE file for details.

## See Also

- **casq** - CLI binary using this library (coming soon)
- **NOTES.md** - Detailed design and specification
- **CLAUDE.md** - Development guidelines for AI assistants