hexz-core
Core engine for high-performance data streaming with compression and deduplication.
Overview
hexz-core is the heart of the Hexz system—a seekable, deduplicated compression engine that enables random access to compressed data without decompressing entire archives. It provides block-level compression, content-defined chunking for deduplication, and pluggable storage backends for local files, HTTP, and S3.
This crate contains no UI code; all user interfaces (CLI, Python bindings, FUSE) are in separate crates.
Architecture
hexz-core/
├── algo/ # Compression, dedup, encryption algorithms
│ ├── compression/ # LZ4, Zstandard
│ ├── encryption/ # AES-256-GCM
│ ├── dedup/ # FastCDC, content-defined chunking
│ └── hash/ # CRC32, BLAKE3
├── cache/ # LRU cache with prefetching
│ └── lru.rs # Block and index page caching
├── format/ # File format handling
│ ├── header.rs # Snapshot metadata (512 bytes)
│ ├── index.rs # Hierarchical index structures
│ └── block.rs # Compressed block format
├── store/ # Storage backends (local, HTTP, S3)
│ ├── local/ # FileBackend, MmapBackend
│ ├── http/ # Remote streaming over HTTP
│ └── s3/ # AWS S3/compatible object storage
├── api/ # Public API surface
│ └── file.rs # Main entry point: File
└── ops/ # High-level operations
└── pack/ # Create snapshots from raw data
Quick Example
Reading a Local Snapshot
use ;
use FileBackend;
use Lz4Compressor;
use Arc;
Streaming from HTTP
use File;
use HttpBackend;
use Lz4Compressor;
use Arc;
Key Features
- Random Access: Read any byte range without decompressing the entire archive
- Block-Level Compression: LZ4 (~2GB/s) or Zstandard (~500MB/s) with independent blocks
- Content-Defined Deduplication: FastCDC chunking automatically eliminates duplicate blocks
- Remote Streaming: Stream from HTTP/S3 with intelligent block prefetching
- Encryption: Optional AES-256-GCM block-level encryption
- Thin Snapshots: Parent references for incremental backups
- Thread-Safe:
FileisSend + Syncwith concurrent read support - Low Latency: ~1ms cold cache, ~0.08ms warm cache random access
- Pluggable Backends: Uniform API for local files, memory-mapped files, HTTP, and S3
File Format
Hexz snapshots consist of:
- Header (512 bytes): Metadata, compression algorithm, encryption info, parent path
- Data Blocks: Variable-size compressed blocks (typically 64KB-256KB)
- Index Pages: Hierarchical B-tree-like index for fast lookups
- Master Index: Points to root index page (location stored in header)
Each block is independently compressed and checksummed (CRC32), enabling:
- Parallel decompression
- Random access to individual blocks
- Block-level integrity verification
See the file format specification for detailed specification.
Performance Characteristics
| Metric | Value |
|---|---|
| Compression (LZ4) | ~2 GB/s |
| Compression (Zstd) | ~500 MB/s |
| Random Access (cold) | ~1 ms |
| Random Access (warm) | ~0.08 ms |
| Sequential Read | ~2-3 GB/s (NVMe + LZ4) |
| Memory Usage | <150 MB (configurable) |
| Deduplication | Up to 40% storage savings |
Storage Backends
All backends implement the StorageBackend trait:
- FileBackend: Standard file I/O
- MmapBackend: Memory-mapped files (zero-copy reads)
- HttpBackend: Remote streaming via HTTP/HTTPS
- S3Backend: AWS S3 and S3-compatible object storage
Higher layers (API, cache, decompression) don't know where data comes from—all backends provide the same interface.
Remote Streaming: When to Use It
Remote streaming (HTTP/S3) allows training without downloading datasets first. However, it's not always the right choice.
✅ Remote Streaming Works Well When:
- Dataset fits in cache (most important!)
- Example: 50 GB dataset, 64 GB cache → 95%+ hit rate after epoch 1
- First epoch is slow (cold cache), subsequent epochs are fast
- High bandwidth available (10+ Gbps datacenter links)
- Prototyping/experimentation (one-off runs, convenience matters)
- Storage is expensive (cloud VMs with limited/costly local disk)
Example performance:
Dataset: 50 GB, Cache: 64 GB, Bandwidth: 10 Gbps
Epoch 1: ~2 hours (streaming from S3, cold cache)
Epoch 2-100: ~10 min each (95%+ cache hits, rarely hits network)
Total: ~18.5 hours
❌ Remote Streaming Fails When:
- Dataset >> cache size (cache thrashing, slow every epoch)
- Example: 2 TB dataset, 64 GB cache → <5% hit rate, constant streaming
- Every epoch becomes a full network transfer (hours per epoch)
- Limited bandwidth (< 1 Gbps residential/cloud egress)
- Repeated training (download overhead amortizes over many runs)
- Production workloads (need predictable, fast performance)
Example performance (BAD):
Dataset: 2 TB, Cache: 64 GB (3% fits), Bandwidth: 1 Gbps
Every epoch: ~4.5 hours of data transfer
100 epochs: 450 hours = 18.75 days 😱
vs Download once (4.5h) + train (17h) = 21.5 hours total
Rule of Thumb
- Dataset < cache size: Remote streaming is viable
- Dataset < 2x cache: Marginal, depends on bandwidth and use case
- Dataset > 2x cache: Download to local NVMe/SSD first
Alternatives for Large Datasets
-
Download to local storage (fastest for repeated training)
-
Subset/curriculum training (stream only what you need)
# Train on 10% subsets, different each phase -
Hierarchical caching (future: RAM + SSD + remote)
L1: 64 GB RAM (hot) L2: 1 TB SSD (warm) L3: S3 (cold)
Remote streaming is a convenience feature, not a replacement for local storage at TB scale. hexz's core value is compression + dedup + fast random access, with remote streaming as a nice-to-have for datasets that fit in cache.
Compression & Encryption
Pluggable algorithms via traits:
Compression
- LZ4: Fast compression (~2GB/s), good for real-time workloads
- Zstandard: Better ratios (~500MB/s), configurable compression levels
Encryption (optional)
- AES-256-GCM: Authenticated encryption with key derivation (PBKDF2)
- Each block encrypted independently
- Metadata encrypted separately
Thread Safety
File is Send + Sync and can be safely wrapped in Arc for multi-threaded access:
use Arc;
use thread;
let snapshot = new;
let handles: =
.map
.collect;
for handle in handles
Development
All development commands use the project Makefile. From the repository root:
Building
# Build entire workspace (includes hexz-core)
# Build in debug mode for faster compilation
# Build with specific features
Testing
# Run all tests (Rust + Python)
# Run only Rust tests
# Run tests with filter
# Or use cargo directly for this crate
Linting & Formatting
# Format all code
# Check formatting + clippy
# Run clippy with strict lints
Benchmarks
# Run all benchmarks
# Run specific benchmark
# Compare against archived baseline
See make help for all available commands.
Cargo Features
default:["compression-zstd", "encryption", "s3"]compression-zstd: Zstandard compression supportencryption: AES-256-GCM encryptions3: S3 storage backend
See Also
- User Documentation - Tutorials, how-to guides, explanations
- API Documentation - Full API reference on docs.rs
- CLI Tool - Command-line interface for creating snapshots
- Python Bindings - PyTorch integration for ML workflows
- Project README - Main project overview