Expand description
Media deduplication and duplicate detection for OxiMedia.
oximedia-dedup provides comprehensive duplicate detection and media deduplication
for the OxiMedia multimedia framework. This includes:
- Cryptographic hashing: BLAKE3-based exact duplicate detection
- Visual similarity: Perceptual hashing, SSIM, histogram, and feature matching
- Audio fingerprinting: Audio fingerprint comparison and waveform similarity
- Metadata matching: Fuzzy metadata comparison for near-duplicates
- Storage optimization: Fast SQLite-based indexing for large libraries
- Reporting: Comprehensive duplicate reports with similarity scoring
§Modules
hash: Cryptographic and content-based hashingvisual: Visual similarity detectionaudio: Audio fingerprint comparisonmetadata: Metadata-based deduplicationdatabase: SQLite-based indexing and lookupreport: Duplicate detection reports
§Example
use oximedia_dedup::{DuplicateDetector, DetectionStrategy, DedupConfig};
let config = DedupConfig::default();
let mut detector = DuplicateDetector::new(config).await?;
// Add files to the index
detector.add_file("/path/to/video1.mp4").await?;
detector.add_file("/path/to/video2.mp4").await?;
// Find duplicates
let duplicates = detector.find_duplicates(DetectionStrategy::All).await?;§Strategy Selection Guide
| Strategy | Speed | Precision | Use case |
|---|---|---|---|
ExactHash | Very fast | Perfect (no false positives) | Bit-for-bit identical files — best first pass for any library |
Fast | Fast | High | Quick scan: hash + perceptual + metadata; good default for large libraries |
PerceptualHash | Fast | Good | Visually identical images/frames that were re-encoded or lightly cropped |
Histogram | Fast | Moderate | Color-similar frames regardless of spatial layout |
AudioFingerprint | Moderate | High | Same audio in different codecs or with minor edits |
Metadata | Fast | Low–moderate | Likely duplicates with same duration/resolution (combine with visual pass) |
Ssim | Slow | Very high | Near-identical video frames where pHash gives too many false positives |
FeatureMatch | Slow | High | Cropped, rotated, or partially occluded duplicates |
VisualAll | Slow | Very high | Combined visual pipeline: pHash + Histogram + SSIM + FeatureMatch |
All | Very slow | Maximum | Full pipeline — all methods; use only for final authoritative scan |
Recommended workflow:
- Run
ExactHashfirst to catch perfect duplicates cheaply. - Run
Fastfor a broad near-duplicate sweep. - Run
VisualAllorAllfor a precision clean-up pass on the remainder.
§Detection Method Trade-offs
| Method | Accuracy | CPU cost | Memory | False-positive risk | Notes |
|---|---|---|---|---|---|
| BLAKE3 hash | 100% | Very low | O(1) | None | Misses re-encoded or edited copies |
| dHash (8×8) | High | Very low | O(1) | Low | Robust to resize; sensitive to crops |
| pHash (DCT) | High | Low | O(1) | Low–medium | Better than dHash for brightness shifts |
| wHash (wavelet) | High | Low | O(1) | Low | Most robust to combined transforms |
| SSIM | Very high | High | O(WH) | Very low | Pixel-accurate; slow for large images |
| Histogram | Moderate | Low | O(256) | Medium | Colour match only; ignores structure |
| FeatureMatch | High | Very high | O(N×D) | Low | Works on crops/rotations; expensive |
| AudioFingerprint | High | Moderate | O(T) | Low | Spectral-peak based; codec-agnostic |
| Metadata | Low–moderate | Very low | O(1) | High | Use only as a pre-filter |
Bloom-filter pre-screening (DedupConfig::bloom_prescreen = true) reduces
the number of pairwise comparisons by rejecting definitely-unique items before
the expensive perceptual-hash phase. Recommended for libraries with > 10 K files.
LSH acceleration (DedupConfig::use_lsh = true, default) replaces O(n²)
pairwise perceptual-hash comparison with sub-quadratic approximate nearest-
neighbour lookup via BitLshIndex. Adjust lsh_num_tables (more tables →
better recall, more memory) and lsh_bits_per_table (fewer bits → more
candidates → better recall at higher CPU cost).
Re-exports§
pub use merge_strategy::AppliedAction;pub use merge_strategy::MergeExecutor;pub use merge_strategy::MergeReport;pub use report::DuplicateGroup;pub use report::DuplicateReport;pub use report::SimilarityScore;
Modules§
- audio
- Audio fingerprinting and similarity detection for deduplication.
- bloom_
filter - Near-duplicate detection using a Bloom filter.
- cluster
- Duplicate clustering: similarity groups, cluster merging, representative selection.
- content_
id - Content ID and fingerprinting for media assets.
- content_
signature - Content-signature types for robust media identification.
- cross_
format - Cross-format duplicate detection: same content in different containers/codecs.
- dedup_
cache - LRU cache for deduplication hash lookups.
- dedup_
index - Persistent-style deduplication index.
- dedup_
policy - Policy types for controlling deduplication behaviour.
- dedup_
report - Deduplication reporting: statistics, summaries, and formatted reports.
- dedup_
report_ ext - Extended deduplication reporting and statistics.
- dedup_
stats - Extended deduplication statistics: space savings, group statistics, action recommendations.
- frame_
hash - Frame-level hash types for fast perceptual deduplication.
- fuzzy_
match - Fuzzy / approximate matching for media deduplication.
- hash
- Cryptographic and content-based hashing for deduplication.
- hash_
store - Persistent hash store for deduplication lookups.
- incremental
- Incremental deduplication: only scan new or modified files.
- lsh_
index - Locality-Sensitive Hashing (LSH) index for approximate nearest-neighbour deduplication of high-dimensional media feature vectors.
- merge_
strategy - Merge strategies for resolving duplicate file groups.
- metadata
- Metadata-based deduplication and fuzzy matching.
- near_
duplicate - Near-duplicate detection using locality-sensitive hashing (LSH).
- perceptual_
hash - Perceptual hashing for image/video deduplication.
- phash
- Perceptual hashing (pHash) and near-duplicate detection for video frames.
- progress
- Progress reporting callbacks for long-running deduplication operations.
- report
- Duplicate detection reports and recommendations.
- rolling_
hash - Rolling hash for content-defined chunking in media deduplication.
- segment_
dedup - Segment-level deduplication for media streams.
- similarity_
index - Similarity index: fast lookup structures for near-duplicate candidate retrieval.
- video_
dedup - Video-level deduplication.
- video_
segment_ dedup - Video segment deduplication using perceptual hashing and temporal windowing.
- visual
- Visual similarity detection for image and video deduplication.
Structs§
- Dedup
Config - Configuration for deduplication.
- Dedup
Stats - Deduplication statistics.
Enums§
- Dedup
Error - Deduplication error type.
- Detection
Strategy - Detection strategy for finding duplicates.
Type Aliases§
- Dedup
Result - Deduplication result type.