Expand description
Streaming duplicate detection without loading entire files into memory.
This module implements content-defined chunking over arbitrary io::Read
sources so that very large media files can be fingerprinted without mapping
the full file into RAM. The approach:
- A
StreamChunkerwraps anyio::Readand emits content-defined chunk hashes viaIterator<Item = ChunkDigest>. Data flows through a fixed-size internal buffer (BUF_SIZEbytes), so memory use is bounded regardless of file size. StreamFingerprintaggregates chunk digests into a compact file-level fingerprint that survives byte-level insertions and deletions (unlike a whole-file BLAKE3 hash).StreamDedupIndexstores fingerprints and answers “is this stream a near-duplicate of something already indexed?” via chunk-level Jaccard similarity.
§Rationale
The existing crate::rolling_hash module provides content-defined chunking
over in-memory byte slices. This module extends the deduplication pipeline
with a streaming interface that satisfies the TODO item:
“Optimize rolling_hash.rs for streaming duplicate detection without
loading entire files”.
§Example
use oximedia_dedup::stream_dedup::{StreamChunkerConfig, StreamDedupIndex};
use std::io::Cursor;
let config = StreamChunkerConfig::default();
let mut index = StreamDedupIndex::new(config.clone());
let data = vec![42u8; 32_768];
let fp = index.ingest("file-a", Cursor::new(data.clone())).expect("ingest ok");
assert!(fp.chunk_count() > 0);
// A second identical stream should be detected as a duplicate.
let fp2 = index.ingest("file-b", Cursor::new(data)).expect("ingest ok");
let sim = index.jaccard_similarity(&fp, &fp2);
assert!((sim - 1.0).abs() < 1e-9);Structs§
- Chunk
Digest - The hash of a single content-defined chunk.
- Stream
Chunker - An iterator over content-defined
ChunkDigests read from anio::Read. - Stream
Chunker Config - Configuration for the streaming content-defined chunker.
- Stream
Dedup Index - Index of stream fingerprints for near-duplicate detection.
- Stream
Fingerprint - A file-level fingerprint derived from its content-defined chunk hashes.