chunkrs
Deterministic, streaming Content-Defined Chunking (CDC) for Rust
chunkrs provides byte-stream chunking for delta synchronization, deduplication, and content-addressable storage. It prioritizes correctness, determinism, and composability over clever parallelism tricks.
Core principle: CDC is inherently serial—parallelize at the application level, not within the stream.
Features
- Streaming-first: Processes multi-GB files with constant memory (no full-file buffering)
- Deterministic-by-design: Identical bytes always produce identical chunk hashes, regardless of batching or execution timing
- Zero-allocation hot path: Thread-local buffer pools eliminate allocator contention under load
- FastCDC algorithm: Gear hash rolling boundary detection with configurable min/avg/max sizes
- BLAKE3 identity: Cryptographic chunk hashing (optional, incremental)
- Runtime-agnostic async: Works with Tokio, async-std, or any
futures-ioruntime - Strictly safe:
#![forbid(unsafe_code)]
When to Use chunkrs
| Scenario | Recommendation |
|---|---|
| Delta sync (rsync-style) | ✅ Perfect fit |
| Backup tools | ✅ Ideal for single-stream chunking |
| Deduplication (CAS) | ✅ Use with your own index |
| NVMe Gen4/5 saturation | ✅ 3–5 GB/s per core |
| Distributed dedup | ✅ Stateless, easy to distribute |
| Any other CDC use case | ✅ Likely fits |
Architecture
chunkrs processes one logical byte stream at a time with strictly serial CDC state:
┌───────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Input Byte │ │ I/O Batching │ │ Serial CDC State │
│ Stream │────▶│ (8KB buffers│────▶ │ Machine │
│ (any io::Read │ │ for syscall │ │ (FastCDC rolling │
│ or AsyncRead)│ │ efficiency) │ │ hash) │
└───────────────┘ └──────────────┘ └──────────────────┘
┌─────────────┐ ┌───────────────────┐
│ │ │ Chunk { │
──▶ │ Chunk │────▶ │ data: Bytes, │
│ Stream │ │ offset: u64, │
│ │ │ hash: ChunkHash │
└─────────────┘ │ } │
└───────────────────┘
Quick Start
[]
= "0.1"
use File;
use ;
What's in the Chunk Stream:
Each element is a Chunk containing:
data:Bytes— the actual chunk payload (zero-copy reference when possible) for subsequent use (e.g., writing to disk)offset:Option<u64>— byte position in the original streamhash:Option<ChunkHash>— BLAKE3 hash for content identity (if enabled)
API Overview
Core Types
| Type | Description |
|---|---|
Chunker |
Stateful CDC engine (maintains rolling hash across batches) |
Chunk |
Content-addressed block with Bytes payload and optional BLAKE3 hash |
ChunkHash |
32-byte BLAKE3 hash identifying chunk content |
ChunkConfig |
Min/avg/max chunk sizes and hash configuration |
ChunkIter |
Iterator over chunks (sync) |
ChunkError |
Error type for chunking operations |
Synchronous Usage
use ;
// From file
let file = open?;
let chunker = new;
for chunk in chunker.chunk
// From memory
let data: = vec!;
let chunks: = chunker.chunk_bytes;
Asynchronous Usage
Runtime-agnostic via futures-io:
use StreamExt;
use ;
async
Tokio compatibility:
use File;
use TokioAsyncReadCompatExt;
let file = open.await?;
let stream = chunk_async;
Configuration
Chunk Sizes
Choose based on your deduplication granularity needs:
use ChunkConfig;
// Small files / high dedup (8 KiB average)
let small = new?;
// Default (16 KiB average) - good general purpose
let default = default;
// Large files / high throughput (256 KiB average)
let large = new?;
Hash Configuration
use ;
// With BLAKE3 (default)
let with_hash = default;
// Boundary detection only (faster, no content identity)
let no_hash = default.with_hash_config;
Performance
Throughput targets on modern hardware:
| Storage | Single-core CDC | Bottleneck |
|---|---|---|
| NVMe Gen4 | ~3–5 GB/s | CPU (hashing) |
| NVMe Gen5 | ~3–5 GB/s | CDC algorithm |
| SATA SSD | ~500 MB/s | Storage |
| 10 Gbps LAN | ~1.2 GB/s | Network |
| HDD | ~200 MB/s | Seek latency |
Memory usage:
- Constant:
O(batch_size)typically 4–16MB per stream - Thread-local cache: ~64MB per thread (reusable)
To saturate NVMe Gen5: Process multiple files concurrently (application-level parallelism). Do not attempt to parallelize within a single file—this destroys deduplication ratios.
Determinism Guarantees
chunkrs guarantees content-addressable identity:
- Strong guarantee: Identical byte streams produce identical
ChunkHash(BLAKE3) values - Boundary stability: For identical inputs and configurations, chunk boundaries are deterministic across different batch sizes or execution timings
- Serial consistency: Rolling hash state is strictly maintained across batch boundaries
What this means: You can re-chunk a file on Tuesday with different I/O batch sizes and get bit-identical chunks to Monday's run. This is essential for delta sync correctness.
Safety & Correctness
- No unsafe code:
#![forbid(unsafe_code)] - Comprehensive testing: Unit tests, doc tests, and property-based tests ensure:
- Determinism invariants
- Batch equivalence (chunking whole vs chunked yields same results)
- No panics on edge cases (empty files, single byte, max-size boundaries)
Algorithm
Boundary Detection: FastCDC (Gear hash rolling hash)
- Byte-by-byte polynomial rolling hash via lookup table
- Dual-mask normalization (small/large chunk detection)
- Configurable min/avg/max constraints
Chunk Identity: BLAKE3 (when enabled)
- Incremental hashing for streaming
- 32-byte cryptographic digests
Cargo Features
| Feature | Description | Default |
|---|---|---|
hash-blake3 |
BLAKE3 chunk hashing | ✅ |
async-io |
Async Stream support via futures-io |
❌ |
# Default: sync + hashing
[]
= "0.1"
# Minimal: sync only, no hashing
[]
= { = "0.1", = false }
# Full featured: sync + async + hashing
[]
= { = "0.1", = ["async-io"] }
Key Architectural Decisions
-
Application provides the byte stream: The library accepts any
std::io::Readorfutures_io::AsyncRead. Whether the bytes come from a file, network socket, or in-memory buffer is entirely the application's concern. The library focuses solely on the CDC transformation. -
Batching for I/O efficiency: Internally reads data in ~8KB buffers to balance syscall overhead with cache-friendly processing, while maintaining CDC state across buffer boundaries for deterministic results.
-
Application-level concurrency: Parallelize by running multiple
chunkrsinstances on different streams. The library stays out of your thread pool. -
Allocation discipline: No global buffer pools. Thread-local caches prevent allocator lock contention when processing thousands of small streams.
Design Philosophy
Why not parallel CDC within files?
Some implementations split large files into "superblocks" and process them in parallel. We explicitly reject this because:
- Deduplication destruction: Byte insertions at file start cause cascade re-chunking across all superblocks
- Complexity: Boundary alignment between superblocks requires either non-deterministic overlap or complex state synchronization
- Unnecessary: Modern CPUs handle 3–5 GB/s per core. Parallelize at the file level instead.
Allocation discipline:
Global buffer pools (lazy_static! pools) cause cache line bouncing and atomic contention under high concurrency. chunkrs uses thread-local caches—zero synchronization, maximum locality.
Acknowledgments
This crate implements the FastCDC algorithm described in:
Wen Xia, Yukun Zhou, Hong Jiang, Dan Feng, Yu Hua, Yuchong Hu, Yuchong Zhang, Qing Liu,
"FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication",
in Proceedings of USENIX Annual Technical Conference (USENIX ATC'16), Denver, CO, USA, June 22–24, 2016, pages: 101-114.
Paper Link
Wen Xia, Xiangyu Zou, Yukun Zhou, Hong Jiang, Chuanyi Liu, Dan Feng, Yu Hua, Yuchong Hu, Yuchong Zhang,
"The Design of Fast Content-Defined Chunking for Data Deduplication based Storage Systems",
IEEE Transactions on Parallel and Distributed Systems (TPDS), 2020.
This crate is inspired by the original fastcdc crate but focuses on a modernized API with streaming-first design, strict determinism, and allocation-conscious internals.
License
MIT License — see LICENSE
Contributing
Issues and pull requests welcome at https://github.com/elemeng/chunkrs