chunkrs

Deterministic, streaming Content-Defined Chunking (CDC) for Rust

chunkrs provides byte-stream chunking for delta synchronization, deduplication, and content-addressable storage. It prioritizes correctness, determinism, and composability over clever parallelism tricks.

Core principle: CDC is inherently serial—parallelize at the application level, not within the stream.

Features

Streaming-first: Processes multi-GB files with constant memory (no full-file buffering)
Deterministic-by-design: Identical bytes always produce identical chunk hashes, regardless of batching or execution timing
Zero-allocation hot path: Thread-local buffer pools eliminate allocator contention under load
FastCDC algorithm: Gear hash rolling boundary detection with configurable min/avg/max sizes
BLAKE3 identity: Cryptographic chunk hashing (optional, incremental)
Runtime-agnostic async: Works with Tokio, async-std, or any futures-io runtime
Strictly safe: #![forbid(unsafe_code)]

When to Use chunkrs

Scenario	Recommendation
Delta sync (rsync-style)	✅ Perfect fit
Backup tools	✅ Ideal for single-stream chunking
Deduplication (CAS)	✅ Use with your own index
NVMe Gen4/5 saturation	✅ 3–5 GB/s per core
Distributed dedup	✅ Stateless, easy to distribute
Any other CDC use case	✅ Likely fits

Architecture

chunkrs processes one logical byte stream at a time with strictly serial CDC state:

┌───────────────┐     ┌──────────────┐      ┌──────────────────┐ 
│ Input Byte    │     │ I/O Batching │      │ Serial CDC State │
│ Stream        │────▶│ (8KB buffers│────▶ │ Machine          │ 
│ (any io::Read │     │  for syscall │      │ (FastCDC rolling │ 
│  or AsyncRead)│     │  efficiency) │      │   hash)          │             
└───────────────┘     └──────────────┘      └──────────────────┘ 

    ┌─────────────┐       ┌───────────────────┐
    │             │       │ Chunk {           │
──▶ │ Chunk      │────▶  │   data: Bytes,    │
    │ Stream      │       │   offset: u64,    │
    │             │       │   hash: ChunkHash │
    └─────────────┘       │ }                 │
                          └───────────────────┘

Quick Start

[dependencies]
chunkrs = "0.1"

use std::fs::File;
use chunkrs::{Chunker, ChunkConfig};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file = File::open("data.bin")?;
    let chunker = Chunker::new(ChunkConfig::default());

    for chunk in chunker.chunk(file) {
        let chunk = chunk?;
        println!("offset: {:?}, len: {}, hash: {:?}", 
            chunk.offset, chunk.len(), chunk.hash);
    }
    
    Ok(())
}

What's in the Chunk Stream:

Each element is a Chunk containing:

data: Bytes — the actual chunk payload (zero-copy reference when possible) for subsequent use (e.g., writing to disk)
offset: Option<u64> — byte position in the original stream
hash: Option<ChunkHash> — BLAKE3 hash for content identity (if enabled)

API Overview

Core Types

Type	Description
`Chunker`	Stateful CDC engine (maintains rolling hash across batches)
`Chunk`	Content-addressed block with `Bytes` payload and optional BLAKE3 hash
`ChunkHash`	32-byte BLAKE3 hash identifying chunk content
`ChunkConfig`	Min/avg/max chunk sizes and hash configuration
`ChunkIter`	Iterator over chunks (sync)
`ChunkError`	Error type for chunking operations

Synchronous Usage

use chunkrs::{Chunker, ChunkConfig};

// From file
let file = std::fs::File::open("data.bin")?;
let chunker = Chunker::new(ChunkConfig::default());
for chunk in chunker.chunk(file) {
    let chunk = chunk?;
    // chunk.data: Bytes - the chunk payload
    // chunk.offset: Option<u64> - position in original stream
    // chunk.hash: Option<ChunkHash> - BLAKE3 hash (if enabled)
}

// From memory
let data: Vec<u8> = vec![0u8; 1024 * 1024];
let chunks: Vec<_> = chunker.chunk_bytes(data);

Asynchronous Usage

Runtime-agnostic via futures-io:

use futures_util::StreamExt;
use chunkrs::{ChunkConfig, ChunkError};

async fn process<R: futures_io::AsyncRead + Unpin>(reader: R) -> Result<(), ChunkError> {
    let mut stream = chunkrs::chunk_async(reader, ChunkConfig::default());
    
    while let Some(chunk) = stream.next().await {
        let chunk = chunk?;
        // Process
    }
    Ok(())
}

Tokio compatibility:

use tokio::fs::File;
use tokio_util::compat::TokioAsyncReadCompatExt;

let file = File::open("data.bin").await?;
let stream = chunkrs::chunk_async(file.compat(), ChunkConfig::default());

Configuration

Chunk Sizes

Choose based on your deduplication granularity needs:

use chunkrs::ChunkConfig;

// Small files / high dedup (8 KiB average)
let small = ChunkConfig::new(2 * 1024, 8 * 1024, 32 * 1024)?;

// Default (16 KiB average) - good general purpose
let default = ChunkConfig::default();

// Large files / high throughput (256 KiB average)  
let large = ChunkConfig::new(64 * 1024, 256 * 1024, 1024 * 1024)?;

Hash Configuration

use chunkrs::{ChunkConfig, HashConfig};

// With BLAKE3 (default)
let with_hash = ChunkConfig::default();

// Boundary detection only (faster, no content identity)
let no_hash = ChunkConfig::default().with_hash_config(HashConfig::disabled());

Performance

Throughput targets on modern hardware:

Storage	Single-core CDC	Bottleneck
NVMe Gen4	~3–5 GB/s	CPU (hashing)
NVMe Gen5	~3–5 GB/s	CDC algorithm
SATA SSD	~500 MB/s	Storage
10 Gbps LAN	~1.2 GB/s	Network
HDD	~200 MB/s	Seek latency

Memory usage:

Constant: O(batch_size) typically 4–16MB per stream
Thread-local cache: ~64MB per thread (reusable)

To saturate NVMe Gen5: Process multiple files concurrently (application-level parallelism). Do not attempt to parallelize within a single file—this destroys deduplication ratios.

Determinism Guarantees

chunkrs guarantees content-addressable identity:

Strong guarantee: Identical byte streams produce identical ChunkHash (BLAKE3) values
Boundary stability: For identical inputs and configurations, chunk boundaries are deterministic across different batch sizes or execution timings
Serial consistency: Rolling hash state is strictly maintained across batch boundaries

What this means: You can re-chunk a file on Tuesday with different I/O batch sizes and get bit-identical chunks to Monday's run. This is essential for delta sync correctness.

Safety & Correctness

No unsafe code: #![forbid(unsafe_code)]
Comprehensive testing: Unit tests, doc tests, and property-based tests ensure:
- Determinism invariants
- Batch equivalence (chunking whole vs chunked yields same results)
- No panics on edge cases (empty files, single byte, max-size boundaries)

Algorithm

Boundary Detection: FastCDC (Gear hash rolling hash)

Byte-by-byte polynomial rolling hash via lookup table
Dual-mask normalization (small/large chunk detection)
Configurable min/avg/max constraints

Chunk Identity: BLAKE3 (when enabled)

Incremental hashing for streaming
32-byte cryptographic digests

Cargo Features

Feature	Description	Default
`hash-blake3`	BLAKE3 chunk hashing	✅
`async-io`	Async `Stream` support via `futures-io`	❌

# Default: sync + hashing
[dependencies]
chunkrs = "0.1"

# Minimal: sync only, no hashing
[dependencies]
chunkrs = { version = "0.1", default-features = false }

# Full featured: sync + async + hashing
[dependencies]
chunkrs = { version = "0.1", features = ["async-io"] }

Key Architectural Decisions

Application provides the byte stream: The library accepts any std::io::Read or futures_io::AsyncRead. Whether the bytes come from a file, network socket, or in-memory buffer is entirely the application's concern. The library focuses solely on the CDC transformation.
Batching for I/O efficiency: Internally reads data in ~8KB buffers to balance syscall overhead with cache-friendly processing, while maintaining CDC state across buffer boundaries for deterministic results.
Application-level concurrency: Parallelize by running multiple chunkrs instances on different streams. The library stays out of your thread pool.
Allocation discipline: No global buffer pools. Thread-local caches prevent allocator lock contention when processing thousands of small streams.

Design Philosophy

Why not parallel CDC within files?

Some implementations split large files into "superblocks" and process them in parallel. We explicitly reject this because:

Deduplication destruction: Byte insertions at file start cause cascade re-chunking across all superblocks
Complexity: Boundary alignment between superblocks requires either non-deterministic overlap or complex state synchronization
Unnecessary: Modern CPUs handle 3–5 GB/s per core. Parallelize at the file level instead.

Allocation discipline:

Global buffer pools (lazy_static! pools) cause cache line bouncing and atomic contention under high concurrency. chunkrs uses thread-local caches—zero synchronization, maximum locality.

Acknowledgments

This crate implements the FastCDC algorithm described in:

Wen Xia, Yukun Zhou, Hong Jiang, Dan Feng, Yu Hua, Yuchong Hu, Yuchong Zhang, Qing Liu,
"FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication",
in Proceedings of USENIX Annual Technical Conference (USENIX ATC'16), Denver, CO, USA, June 22–24, 2016, pages: 101-114.
Paper Link

Wen Xia, Xiangyu Zou, Yukun Zhou, Hong Jiang, Chuanyi Liu, Dan Feng, Yu Hua, Yuchong Hu, Yuchong Zhang,
"The Design of Fast Content-Defined Chunking for Data Deduplication based Storage Systems",
IEEE Transactions on Parallel and Distributed Systems (TPDS), 2020.

This crate is inspired by the original fastcdc crate but focuses on a modernized API with streaming-first design, strict determinism, and allocation-conscious internals.

License

MIT License — see LICENSE

Contributing

Issues and pull requests welcome at https://github.com/elemeng/chunkrs

chunkrs 0.1.0