Crate oximedia_dedup

Expand description

Media deduplication and duplicate detection for OxiMedia.

oximedia-dedup provides comprehensive duplicate detection and media deduplication for the OxiMedia multimedia framework. This includes:

Cryptographic hashing: BLAKE3-based exact duplicate detection
Visual similarity: Perceptual hashing, SSIM, histogram, and feature matching
Audio fingerprinting: Audio fingerprint comparison and waveform similarity
Metadata matching: Fuzzy metadata comparison for near-duplicates
Storage optimization: Fast SQLite-based indexing for large libraries
Reporting: Comprehensive duplicate reports with similarity scoring

§Modules

hash: Cryptographic and content-based hashing
visual: Visual similarity detection
audio: Audio fingerprint comparison
metadata: Metadata-based deduplication
database: SQLite-based indexing and lookup
report: Duplicate detection reports

§Example

use oximedia_dedup::{DuplicateDetector, DetectionStrategy, DedupConfig};

let config = DedupConfig::default();
let mut detector = DuplicateDetector::new(config).await?;

// Add files to the index
detector.add_file("/path/to/video1.mp4").await?;
detector.add_file("/path/to/video2.mp4").await?;

// Find duplicates
let duplicates = detector.find_duplicates(DetectionStrategy::All).await?;

§Strategy Selection Guide

Strategy	Speed	Precision	Use case
`ExactHash`	Very fast	Perfect (no false positives)	Bit-for-bit identical files — best first pass for any library
`Fast`	Fast	High	Quick scan: hash + perceptual + metadata; good default for large libraries
`PerceptualHash`	Fast	Good	Visually identical images/frames that were re-encoded or lightly cropped
`Histogram`	Fast	Moderate	Color-similar frames regardless of spatial layout
`AudioFingerprint`	Moderate	High	Same audio in different codecs or with minor edits
`Metadata`	Fast	Low–moderate	Likely duplicates with same duration/resolution (combine with visual pass)
`Ssim`	Slow	Very high	Near-identical video frames where pHash gives too many false positives
`FeatureMatch`	Slow	High	Cropped, rotated, or partially occluded duplicates
`VisualAll`	Slow	Very high	Combined visual pipeline: pHash + Histogram + SSIM + FeatureMatch
`All`	Very slow	Maximum	Full pipeline — all methods; use only for final authoritative scan

Recommended workflow:

Run ExactHash first to catch perfect duplicates cheaply.
Run Fast for a broad near-duplicate sweep.
Run VisualAll or All for a precision clean-up pass on the remainder.

§Detection Method Trade-offs

Method	Accuracy	CPU cost	Memory	False-positive risk	Notes
BLAKE3 hash	100%	Very low	O(1)	None	Misses re-encoded or edited copies
dHash (8×8)	High	Very low	O(1)	Low	Robust to resize; sensitive to crops
pHash (DCT)	High	Low	O(1)	Low–medium	Better than dHash for brightness shifts
wHash (wavelet)	High	Low	O(1)	Low	Most robust to combined transforms
SSIM	Very high	High	O(WH)	Very low	Pixel-accurate; slow for large images
Histogram	Moderate	Low	O(256)	Medium	Colour match only; ignores structure
FeatureMatch	High	Very high	O(N×D)	Low	Works on crops/rotations; expensive
AudioFingerprint	High	Moderate	O(T)	Low	Spectral-peak based; codec-agnostic
Metadata	Low–moderate	Very low	O(1)	High	Use only as a pre-filter

Bloom-filter pre-screening (DedupConfig::bloom_prescreen = true) reduces the number of pairwise comparisons by rejecting definitely-unique items before the expensive perceptual-hash phase. Recommended for libraries with > 10 K files.

LSH acceleration (DedupConfig::use_lsh = true, default) replaces O(n²) pairwise perceptual-hash comparison with sub-quadratic approximate nearest- neighbour lookup via BitLshIndex. Adjust lsh_num_tables (more tables → better recall, more memory) and lsh_bits_per_table (fewer bits → more candidates → better recall at higher CPU cost).

Re-exports§

pub use merge_strategy::AppliedAction;
pub use merge_strategy::MergeExecutor;
pub use merge_strategy::MergeReport;
pub use report::DuplicateGroup;
pub use report::DuplicateReport;
pub use report::SimilarityScore;

Modules§

audio: Audio fingerprinting and similarity detection for deduplication.
audio_fingerprint: Audio fingerprinting for deduplication.
bloom_filter: Near-duplicate detection using a Bloom filter.
bloom_prescreen: Bloom filter pre-screening for deduplication pipelines.
chromagram: Chromagram-based audio feature extraction for music deduplication.
cluster: Duplicate clustering: similarity groups, cluster merging, representative selection.
content_id: Content ID and fingerprinting for media assets.
content_signature: Content-signature types for robust media identification.
cross_format: Cross-format duplicate detection: same content in different containers/codecs.
dedup_cache: LRU cache for deduplication hash lookups.
dedup_index: Persistent-style deduplication index.
dedup_policy: Policy types for controlling deduplication behaviour.
dedup_queue: Deduplication work queue with priority scheduling.
dedup_report: Deduplication reporting: statistics, summaries, and formatted reports.
dedup_report_detailed: Detailed deduplication reporting with disk-space savings, confidence scores, and action recommendations.
dedup_report_ext: Extended deduplication reporting and statistics.
dedup_stats: Extended deduplication statistics: space savings, group statistics, action recommendations.
exact_match: Exact file and content matching for deduplication.
frame_hash: Frame-level hash types for fast perceptual deduplication.
fuzzy_match: Fuzzy / approximate matching for media deduplication.
hash: Cryptographic and content-based hashing for deduplication.
hash_store: Persistent hash store for deduplication lookups.
hierarchical: Hierarchical deduplication: fast pass (hash) -> medium pass (perceptual) -> slow pass (SSIM).
incremental: Incremental deduplication: only scan new or modified files.
lsh_index: Locality-Sensitive Hashing (LSH) index for approximate nearest-neighbour deduplication of high-dimensional media feature vectors.
merge_strategy: Merge strategies for resolving duplicate file groups.
metadata: Metadata-based deduplication and fuzzy matching.
minhash: MinHash-based approximate similarity estimation.
near_duplicate: Near-duplicate detection using locality-sensitive hashing (LSH).
near_duplicate_cluster: Near-duplicate clustering with union-find, threshold-based merging, cluster representative selection, and cluster statistics.
network_dedup: Network-aware deduplication for distributed media libraries.
parallel_indexer: Parallel bulk file indexing for large media libraries.
perceptual_hash: Perceptual hashing for image/video deduplication.
persistent_cache: Cross-session persistent cache for decoded thumbnails and media fingerprints.
phash: Perceptual hashing (pHash) and near-duplicate detection for video frames.
progress: Progress reporting callbacks for long-running deduplication operations.
report: Duplicate detection reports and recommendations.
rolling_hash: Rolling hash for content-defined chunking in media deduplication.
segment_dedup: Segment-level deduplication for media streams.
signature_store: In-memory signature storage with lookup and expiration.
similarity_index: Similarity index: fast lookup structures for near-duplicate candidate retrieval.
space_savings: Disk space savings estimation for duplicate file groups.
stream_dedup: Streaming duplicate detection without loading entire files into memory.
video_dedup: Video-level deduplication.
video_dedup_pipeline: Full video deduplication pipeline.
video_segment_dedup: Video segment deduplication using perceptual hashing and temporal windowing.
visual: Visual similarity detection for image and video deduplication.

Structs§

DedupConfig: Configuration for deduplication.
DedupStats: Deduplication statistics.

Enums§

DedupError: Deduplication error type.
DetectionStrategy: Detection strategy for finding duplicates.

Type Aliases§

DedupResult: Deduplication result type.