# chunkrs
[](https://crates.io/crates/chunkrs) [](https://docs.rs/chunkrs) [](LICENSE) [](https://blog.rust-lang.org/2024/02/28/Rust-1.85.0.html) [](https://github.com/rust-secure-code/safety-dance/)
> **Deterministic, streaming Content-Defined Chunking (CDC) for Rust**
`chunkrs` provides byte-stream chunking for delta synchronization, deduplication, and content-addressable storage. It prioritizes **correctness, determinism, and composability** over clever parallelism tricks.
**Core principle**: *CDC is inherently serial—parallelize at the application level, not within the stream.*
## Features
- **Streaming-first**: Processes multi-GB files with constant memory (no full-file buffering)
- **Deterministic-by-design**: Identical bytes always produce identical chunk hashes, regardless of batching or execution timing
- **Zero-allocation hot path**: Thread-local buffer pools eliminate allocator contention under load
- **FastCDC algorithm**: Gear hash rolling boundary detection with configurable min/avg/max sizes
- **BLAKE3 identity**: Cryptographic chunk hashing (optional, incremental)
- **Runtime-agnostic async**: Works with Tokio, async-std, or any `futures-io` runtime
- **Strictly safe**: `#![forbid(unsafe_code)]`
## When to Use chunkrs
| Delta sync (rsync-style) | ✅ Perfect fit |
| Backup tools | ✅ Ideal for single-stream chunking |
| Deduplication (CAS) | ✅ Use with your own index |
| NVMe Gen4/5 saturation | ✅ 3–5 GB/s per core |
| Distributed dedup | ✅ Stateless, easy to distribute |
| Any other CDC use case | ✅ Likely fits |
## Architecture
chunkrs processes **one logical byte stream at a time** with strictly serial CDC state:
```text
┌───────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Input Byte │ │ I/O Batching │ │ Serial CDC State │
│ Stream │────▶│ (8KB buffers│────▶ │ Machine │
│ (any io::Read │ │ for syscall │ │ (FastCDC rolling │
│ or AsyncRead)│ │ efficiency) │ │ hash) │
└───────────────┘ └──────────────┘ └──────────────────┘
┌─────────────┐ ┌───────────────────┐
│ │ │ Chunk { │
──▶ │ Chunk │────▶ │ data: Bytes, │
│ Stream │ │ offset: u64, │
│ │ │ hash: ChunkHash │
└─────────────┘ │ } │
└───────────────────┘
```
## Quick Start
```toml
[dependencies]
chunkrs = "0.1"
```
```rust
use std::fs::File;
use chunkrs::{Chunker, ChunkConfig};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let file = File::open("data.bin")?;
let chunker = Chunker::new(ChunkConfig::default());
for chunk in chunker.chunk(file) {
let chunk = chunk?;
println!("offset: {:?}, len: {}, hash: {:?}",
chunk.offset, chunk.len(), chunk.hash);
}
Ok(())
}
```
**What's in the Chunk Stream:**
Each element is a `Chunk` containing:
- **`data`**: `Bytes` — the actual chunk payload (zero-copy reference when possible) for subsequent use (e.g., writing to disk)
- **`offset`**: `Option<u64>` — byte position in the original stream
- **`hash`**: `Option<ChunkHash>` — BLAKE3 hash for content identity (if enabled)
## API Overview
### Core Types
| `Chunker` | Stateful CDC engine (maintains rolling hash across batches) |
| `Chunk` | Content-addressed block with `Bytes` payload and optional BLAKE3 hash |
| `ChunkHash` | 32-byte BLAKE3 hash identifying chunk content |
| `ChunkConfig` | Min/avg/max chunk sizes and hash configuration |
| `ChunkIter` | Iterator over chunks (sync) |
| `ChunkError` | Error type for chunking operations |
### Synchronous Usage
```rust
use chunkrs::{Chunker, ChunkConfig};
// From file
let file = std::fs::File::open("data.bin")?;
let chunker = Chunker::new(ChunkConfig::default());
for chunk in chunker.chunk(file) {
let chunk = chunk?;
// chunk.data: Bytes - the chunk payload
// chunk.offset: Option<u64> - position in original stream
// chunk.hash: Option<ChunkHash> - BLAKE3 hash (if enabled)
}
// From memory
let data: Vec<u8> = vec![0u8; 1024 * 1024];
let chunks: Vec<_> = chunker.chunk_bytes(data);
```
### Asynchronous Usage
Runtime-agnostic via `futures-io`:
```rust
use futures_util::StreamExt;
use chunkrs::{ChunkConfig, ChunkError};
async fn process<R: futures_io::AsyncRead + Unpin>(reader: R) -> Result<(), ChunkError> {
let mut stream = chunkrs::chunk_async(reader, ChunkConfig::default());
while let Some(chunk) = stream.next().await {
let chunk = chunk?;
// Process
}
Ok(())
}
```
**Tokio compatibility:**
```rust
use tokio::fs::File;
use tokio_util::compat::TokioAsyncReadCompatExt;
let file = File::open("data.bin").await?;
let stream = chunkrs::chunk_async(file.compat(), ChunkConfig::default());
```
## Configuration
### Chunk Sizes
Choose based on your deduplication granularity needs:
```rust
use chunkrs::ChunkConfig;
// Small files / high dedup (8 KiB average)
let small = ChunkConfig::new(2 * 1024, 8 * 1024, 32 * 1024)?;
// Default (16 KiB average) - good general purpose
let default = ChunkConfig::default();
// Large files / high throughput (256 KiB average)
let large = ChunkConfig::new(64 * 1024, 256 * 1024, 1024 * 1024)?;
```
### Hash Configuration
```rust
use chunkrs::{ChunkConfig, HashConfig};
// With BLAKE3 (default)
let with_hash = ChunkConfig::default();
// Boundary detection only (faster, no content identity)
let no_hash = ChunkConfig::default().with_hash_config(HashConfig::disabled());
```
## Performance
**Throughput targets on modern hardware:**
| NVMe Gen4 | ~3–5 GB/s | CPU (hashing) |
| NVMe Gen5 | ~3–5 GB/s | CDC algorithm |
| SATA SSD | ~500 MB/s | Storage |
| 10 Gbps LAN | ~1.2 GB/s | Network |
| HDD | ~200 MB/s | Seek latency |
**Memory usage:**
- Constant: `O(batch_size)` typically 4–16MB per stream
- Thread-local cache: ~64MB per thread (reusable)
**To saturate NVMe Gen5:**
Process multiple files concurrently (application-level parallelism). Do not attempt to parallelize within a single file—this destroys deduplication ratios.
## Determinism Guarantees
chunkrs guarantees **content-addressable identity**:
- **Strong guarantee**: Identical byte streams produce identical `ChunkHash` (BLAKE3) values
- **Boundary stability**: For identical inputs and configurations, chunk boundaries are deterministic across different batch sizes or execution timings
- **Serial consistency**: Rolling hash state is strictly maintained across batch boundaries
**What this means:**
You can re-chunk a file on Tuesday with different I/O batch sizes and get bit-identical chunks to Monday's run. This is essential for delta sync correctness.
## Safety & Correctness
- **No unsafe code**: `#![forbid(unsafe_code)]`
- **Comprehensive testing**: Unit tests, doc tests, and property-based tests ensure:
- Determinism invariants
- Batch equivalence (chunking whole vs chunked yields same results)
- No panics on edge cases (empty files, single byte, max-size boundaries)
## Algorithm
**Boundary Detection**: [FastCDC](https://www.usenix.org/conference/atc16/technical-sessions/presentation/xia) (Gear hash rolling hash)
- Byte-by-byte polynomial rolling hash via lookup table
- Dual-mask normalization (small/large chunk detection)
- Configurable min/avg/max constraints
**Chunk Identity**: BLAKE3 (when enabled)
- Incremental hashing for streaming
- 32-byte cryptographic digests
## Cargo Features
| `hash-blake3` | BLAKE3 chunk hashing | ✅ |
| `async-io` | Async `Stream` support via `futures-io` | ❌ |
```toml
# Default: sync + hashing
[dependencies]
chunkrs = "0.1"
# Minimal: sync only, no hashing
[dependencies]
chunkrs = { version = "0.1", default-features = false }
# Full featured: sync + async + hashing
[dependencies]
chunkrs = { version = "0.1", features = ["async-io"] }
```
## Key Architectural Decisions
1. **Application provides the byte stream**: The library accepts any `std::io::Read` or `futures_io::AsyncRead`. Whether the bytes come from a file, network socket, or in-memory buffer is entirely the application's concern. The library focuses solely on the CDC transformation.
2. **Batching for I/O efficiency**: Internally reads data in ~8KB buffers to balance syscall overhead with cache-friendly processing, while maintaining CDC state across buffer boundaries for deterministic results.
3. **Application-level concurrency**: Parallelize by running multiple `chunkrs` instances on different streams. The library stays out of your thread pool.
4. **Allocation discipline**: No global buffer pools. Thread-local caches prevent allocator lock contention when processing thousands of small streams.
## Design Philosophy
**Why not parallel CDC within files?**
Some implementations split large files into "superblocks" and process them in parallel. We explicitly reject this because:
1. **Deduplication destruction**: Byte insertions at file start cause cascade re-chunking across all superblocks
2. **Complexity**: Boundary alignment between superblocks requires either non-deterministic overlap or complex state synchronization
3. **Unnecessary**: Modern CPUs handle 3–5 GB/s per core. Parallelize at the file level instead.
**Allocation discipline:**
Global buffer pools (`lazy_static!` pools) cause cache line bouncing and atomic contention under high concurrency. chunkrs uses **thread-local caches**—zero synchronization, maximum locality.
## Acknowledgments
This crate implements the FastCDC algorithm described in:
> Wen Xia, Yukun Zhou, Hong Jiang, Dan Feng, Yu Hua, Yuchong Hu, Yuchong Zhang, Qing Liu,
> **"FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication"**,
> in Proceedings of USENIX Annual Technical Conference (USENIX ATC'16), Denver, CO, USA, June 22–24, 2016, pages: 101-114.
> [Paper Link](https://www.usenix.org/conference/atc16/technical-sessions/presentation/xia)
> Wen Xia, Xiangyu Zou, Yukun Zhou, Hong Jiang, Chuanyi Liu, Dan Feng, Yu Hua, Yuchong Hu, Yuchong Zhang,
> **"The Design of Fast Content-Defined Chunking for Data Deduplication based Storage Systems"**,
> IEEE Transactions on Parallel and Distributed Systems (TPDS), 2020.
This crate is inspired by the original [fastcdc](https://crates.io/crates/fastcdc) crate but focuses on a modernized API with streaming-first design, strict determinism, and allocation-conscious internals.
## License
MIT License — see [LICENSE](LICENSE)
## Contributing
Issues and pull requests welcome at [https://github.com/elemeng/chunkrs](https://github.com/elemeng/chunkrs)