chunkrs 0.1.0

Streaming Content-Defined Chunking (CDC) using FastCDC algorithm with modernized API, support sync and async I/O. Prioritizes correctness, determinism, and composability. Flexible async backend support such as Tokio, async-std, and smol ect. Streaming in --> chunks & hash out.
Documentation

chunkrs

Crates.io Documentation License: MIT Rust Version Unsafe Forbidden

Deterministic, streaming Content-Defined Chunking (CDC) for Rust

chunkrs provides byte-stream chunking for delta synchronization, deduplication, and content-addressable storage. It prioritizes correctness, determinism, and composability over clever parallelism tricks.

Core principle: CDC is inherently serial—parallelize at the application level, not within the stream.

Features

  • Streaming-first: Processes multi-GB files with constant memory (no full-file buffering)
  • Deterministic-by-design: Identical bytes always produce identical chunk hashes, regardless of batching or execution timing
  • Zero-allocation hot path: Thread-local buffer pools eliminate allocator contention under load
  • FastCDC algorithm: Gear hash rolling boundary detection with configurable min/avg/max sizes
  • BLAKE3 identity: Cryptographic chunk hashing (optional, incremental)
  • Runtime-agnostic async: Works with Tokio, async-std, or any futures-io runtime
  • Strictly safe: #![forbid(unsafe_code)]

When to Use chunkrs

Scenario Recommendation
Delta sync (rsync-style) ✅ Perfect fit
Backup tools ✅ Ideal for single-stream chunking
Deduplication (CAS) ✅ Use with your own index
NVMe Gen4/5 saturation ✅ 3–5 GB/s per core
Distributed dedup ✅ Stateless, easy to distribute
Any other CDC use case ✅ Likely fits

Architecture

chunkrs processes one logical byte stream at a time with strictly serial CDC state:

┌───────────────┐     ┌──────────────┐      ┌──────────────────┐ 
│ Input Byte    │     │ I/O Batching │      │ Serial CDC State │
│ Stream        │────▶│ (8KB buffers│────▶ │ Machine          │ 
│ (any io::Read │     │  for syscall │      │ (FastCDC rolling │ 
│  or AsyncRead)│     │  efficiency) │      │   hash)          │             
└───────────────┘     └──────────────┘      └──────────────────┘ 

    ┌─────────────┐       ┌───────────────────┐
    │             │       │ Chunk {           │
──▶ │ Chunk      │────▶  │   data: Bytes,    │
    │ Stream      │       │   offset: u64,    │
    │             │       │   hash: ChunkHash │
    └─────────────┘       │ }                 │
                          └───────────────────┘   

Quick Start

[dependencies]
chunkrs = "0.1"
use std::fs::File;
use chunkrs::{Chunker, ChunkConfig};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file = File::open("data.bin")?;
    let chunker = Chunker::new(ChunkConfig::default());

    for chunk in chunker.chunk(file) {
        let chunk = chunk?;
        println!("offset: {:?}, len: {}, hash: {:?}", 
            chunk.offset, chunk.len(), chunk.hash);
    }
    
    Ok(())
}

What's in the Chunk Stream:

Each element is a Chunk containing:

  • data: Bytes — the actual chunk payload (zero-copy reference when possible) for subsequent use (e.g., writing to disk)
  • offset: Option<u64> — byte position in the original stream
  • hash: Option<ChunkHash> — BLAKE3 hash for content identity (if enabled)

API Overview

Core Types

Type Description
Chunker Stateful CDC engine (maintains rolling hash across batches)
Chunk Content-addressed block with Bytes payload and optional BLAKE3 hash
ChunkHash 32-byte BLAKE3 hash identifying chunk content
ChunkConfig Min/avg/max chunk sizes and hash configuration
ChunkIter Iterator over chunks (sync)
ChunkError Error type for chunking operations

Synchronous Usage

use chunkrs::{Chunker, ChunkConfig};

// From file
let file = std::fs::File::open("data.bin")?;
let chunker = Chunker::new(ChunkConfig::default());
for chunk in chunker.chunk(file) {
    let chunk = chunk?;
    // chunk.data: Bytes - the chunk payload
    // chunk.offset: Option<u64> - position in original stream
    // chunk.hash: Option<ChunkHash> - BLAKE3 hash (if enabled)
}

// From memory
let data: Vec<u8> = vec![0u8; 1024 * 1024];
let chunks: Vec<_> = chunker.chunk_bytes(data);

Asynchronous Usage

Runtime-agnostic via futures-io:

use futures_util::StreamExt;
use chunkrs::{ChunkConfig, ChunkError};

async fn process<R: futures_io::AsyncRead + Unpin>(reader: R) -> Result<(), ChunkError> {
    let mut stream = chunkrs::chunk_async(reader, ChunkConfig::default());
    
    while let Some(chunk) = stream.next().await {
        let chunk = chunk?;
        // Process
    }
    Ok(())
}

Tokio compatibility:

use tokio::fs::File;
use tokio_util::compat::TokioAsyncReadCompatExt;

let file = File::open("data.bin").await?;
let stream = chunkrs::chunk_async(file.compat(), ChunkConfig::default());

Configuration

Chunk Sizes

Choose based on your deduplication granularity needs:

use chunkrs::ChunkConfig;

// Small files / high dedup (8 KiB average)
let small = ChunkConfig::new(2 * 1024, 8 * 1024, 32 * 1024)?;

// Default (16 KiB average) - good general purpose
let default = ChunkConfig::default();

// Large files / high throughput (256 KiB average)  
let large = ChunkConfig::new(64 * 1024, 256 * 1024, 1024 * 1024)?;

Hash Configuration

use chunkrs::{ChunkConfig, HashConfig};

// With BLAKE3 (default)
let with_hash = ChunkConfig::default();

// Boundary detection only (faster, no content identity)
let no_hash = ChunkConfig::default().with_hash_config(HashConfig::disabled());

Performance

Throughput targets on modern hardware:

Storage Single-core CDC Bottleneck
NVMe Gen4 ~3–5 GB/s CPU (hashing)
NVMe Gen5 ~3–5 GB/s CDC algorithm
SATA SSD ~500 MB/s Storage
10 Gbps LAN ~1.2 GB/s Network
HDD ~200 MB/s Seek latency

Memory usage:

  • Constant: O(batch_size) typically 4–16MB per stream
  • Thread-local cache: ~64MB per thread (reusable)

To saturate NVMe Gen5: Process multiple files concurrently (application-level parallelism). Do not attempt to parallelize within a single file—this destroys deduplication ratios.

Determinism Guarantees

chunkrs guarantees content-addressable identity:

  • Strong guarantee: Identical byte streams produce identical ChunkHash (BLAKE3) values
  • Boundary stability: For identical inputs and configurations, chunk boundaries are deterministic across different batch sizes or execution timings
  • Serial consistency: Rolling hash state is strictly maintained across batch boundaries

What this means: You can re-chunk a file on Tuesday with different I/O batch sizes and get bit-identical chunks to Monday's run. This is essential for delta sync correctness.

Safety & Correctness

  • No unsafe code: #![forbid(unsafe_code)]
  • Comprehensive testing: Unit tests, doc tests, and property-based tests ensure:
    • Determinism invariants
    • Batch equivalence (chunking whole vs chunked yields same results)
    • No panics on edge cases (empty files, single byte, max-size boundaries)

Algorithm

Boundary Detection: FastCDC (Gear hash rolling hash)

  • Byte-by-byte polynomial rolling hash via lookup table
  • Dual-mask normalization (small/large chunk detection)
  • Configurable min/avg/max constraints

Chunk Identity: BLAKE3 (when enabled)

  • Incremental hashing for streaming
  • 32-byte cryptographic digests

Cargo Features

Feature Description Default
hash-blake3 BLAKE3 chunk hashing
async-io Async Stream support via futures-io
# Default: sync + hashing
[dependencies]
chunkrs = "0.1"

# Minimal: sync only, no hashing
[dependencies]
chunkrs = { version = "0.1", default-features = false }

# Full featured: sync + async + hashing
[dependencies]
chunkrs = { version = "0.1", features = ["async-io"] }

Key Architectural Decisions

  1. Application provides the byte stream: The library accepts any std::io::Read or futures_io::AsyncRead. Whether the bytes come from a file, network socket, or in-memory buffer is entirely the application's concern. The library focuses solely on the CDC transformation.

  2. Batching for I/O efficiency: Internally reads data in ~8KB buffers to balance syscall overhead with cache-friendly processing, while maintaining CDC state across buffer boundaries for deterministic results.

  3. Application-level concurrency: Parallelize by running multiple chunkrs instances on different streams. The library stays out of your thread pool.

  4. Allocation discipline: No global buffer pools. Thread-local caches prevent allocator lock contention when processing thousands of small streams.

Design Philosophy

Why not parallel CDC within files?

Some implementations split large files into "superblocks" and process them in parallel. We explicitly reject this because:

  1. Deduplication destruction: Byte insertions at file start cause cascade re-chunking across all superblocks
  2. Complexity: Boundary alignment between superblocks requires either non-deterministic overlap or complex state synchronization
  3. Unnecessary: Modern CPUs handle 3–5 GB/s per core. Parallelize at the file level instead.

Allocation discipline:

Global buffer pools (lazy_static! pools) cause cache line bouncing and atomic contention under high concurrency. chunkrs uses thread-local caches—zero synchronization, maximum locality.

Acknowledgments

This crate implements the FastCDC algorithm described in:

Wen Xia, Yukun Zhou, Hong Jiang, Dan Feng, Yu Hua, Yuchong Hu, Yuchong Zhang, Qing Liu,
"FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication",
in Proceedings of USENIX Annual Technical Conference (USENIX ATC'16), Denver, CO, USA, June 22–24, 2016, pages: 101-114.
Paper Link

Wen Xia, Xiangyu Zou, Yukun Zhou, Hong Jiang, Chuanyi Liu, Dan Feng, Yu Hua, Yuchong Hu, Yuchong Zhang,
"The Design of Fast Content-Defined Chunking for Data Deduplication based Storage Systems",
IEEE Transactions on Parallel and Distributed Systems (TPDS), 2020.

This crate is inspired by the original fastcdc crate but focuses on a modernized API with streaming-first design, strict determinism, and allocation-conscious internals.

License

MIT License — see LICENSE

Contributing

Issues and pull requests welcome at https://github.com/elemeng/chunkrs