Skip to main content

Module stream_dedup

Module stream_dedup 

Source
Expand description

Streaming duplicate detection without loading entire files into memory.

This module implements content-defined chunking over arbitrary io::Read sources so that very large media files can be fingerprinted without mapping the full file into RAM. The approach:

  1. A StreamChunker wraps any io::Read and emits content-defined chunk hashes via Iterator<Item = ChunkDigest>. Data flows through a fixed-size internal buffer (BUF_SIZE bytes), so memory use is bounded regardless of file size.
  2. StreamFingerprint aggregates chunk digests into a compact file-level fingerprint that survives byte-level insertions and deletions (unlike a whole-file BLAKE3 hash).
  3. StreamDedupIndex stores fingerprints and answers “is this stream a near-duplicate of something already indexed?” via chunk-level Jaccard similarity.

§Rationale

The existing crate::rolling_hash module provides content-defined chunking over in-memory byte slices. This module extends the deduplication pipeline with a streaming interface that satisfies the TODO item: “Optimize rolling_hash.rs for streaming duplicate detection without loading entire files”.

§Example

use oximedia_dedup::stream_dedup::{StreamChunkerConfig, StreamDedupIndex};
use std::io::Cursor;

let config = StreamChunkerConfig::default();
let mut index = StreamDedupIndex::new(config.clone());

let data = vec![42u8; 32_768];
let fp = index.ingest("file-a", Cursor::new(data.clone())).expect("ingest ok");
assert!(fp.chunk_count() > 0);

// A second identical stream should be detected as a duplicate.
let fp2 = index.ingest("file-b", Cursor::new(data)).expect("ingest ok");
let sim = index.jaccard_similarity(&fp, &fp2);
assert!((sim - 1.0).abs() < 1e-9);

Structs§

ChunkDigest
The hash of a single content-defined chunk.
StreamChunker
An iterator over content-defined ChunkDigests read from an io::Read.
StreamChunkerConfig
Configuration for the streaming content-defined chunker.
StreamDedupIndex
Index of stream fingerprints for near-duplicate detection.
StreamFingerprint
A file-level fingerprint derived from its content-defined chunk hashes.