Skip to main content

Crate oximedia_dedup

Crate oximedia_dedup 

Source
Expand description

Media deduplication and duplicate detection for OxiMedia.

oximedia-dedup provides comprehensive duplicate detection and media deduplication for the OxiMedia multimedia framework. This includes:

  • Cryptographic hashing: BLAKE3-based exact duplicate detection
  • Visual similarity: Perceptual hashing, SSIM, histogram, and feature matching
  • Audio fingerprinting: Audio fingerprint comparison and waveform similarity
  • Metadata matching: Fuzzy metadata comparison for near-duplicates
  • Storage optimization: Fast SQLite-based indexing for large libraries
  • Reporting: Comprehensive duplicate reports with similarity scoring

§Modules

  • hash: Cryptographic and content-based hashing
  • visual: Visual similarity detection
  • audio: Audio fingerprint comparison
  • metadata: Metadata-based deduplication
  • database: SQLite-based indexing and lookup
  • report: Duplicate detection reports

§Example

use oximedia_dedup::{DuplicateDetector, DetectionStrategy, DedupConfig};

let config = DedupConfig::default();
let mut detector = DuplicateDetector::new(config).await?;

// Add files to the index
detector.add_file("/path/to/video1.mp4").await?;
detector.add_file("/path/to/video2.mp4").await?;

// Find duplicates
let duplicates = detector.find_duplicates(DetectionStrategy::All).await?;

§Strategy Selection Guide

StrategySpeedPrecisionUse case
ExactHashVery fastPerfect (no false positives)Bit-for-bit identical files — best first pass for any library
FastFastHighQuick scan: hash + perceptual + metadata; good default for large libraries
PerceptualHashFastGoodVisually identical images/frames that were re-encoded or lightly cropped
HistogramFastModerateColor-similar frames regardless of spatial layout
AudioFingerprintModerateHighSame audio in different codecs or with minor edits
MetadataFastLow–moderateLikely duplicates with same duration/resolution (combine with visual pass)
SsimSlowVery highNear-identical video frames where pHash gives too many false positives
FeatureMatchSlowHighCropped, rotated, or partially occluded duplicates
VisualAllSlowVery highCombined visual pipeline: pHash + Histogram + SSIM + FeatureMatch
AllVery slowMaximumFull pipeline — all methods; use only for final authoritative scan

Recommended workflow:

  1. Run ExactHash first to catch perfect duplicates cheaply.
  2. Run Fast for a broad near-duplicate sweep.
  3. Run VisualAll or All for a precision clean-up pass on the remainder.

§Detection Method Trade-offs

MethodAccuracyCPU costMemoryFalse-positive riskNotes
BLAKE3 hash100%Very lowO(1)NoneMisses re-encoded or edited copies
dHash (8×8)HighVery lowO(1)LowRobust to resize; sensitive to crops
pHash (DCT)HighLowO(1)Low–mediumBetter than dHash for brightness shifts
wHash (wavelet)HighLowO(1)LowMost robust to combined transforms
SSIMVery highHighO(WH)Very lowPixel-accurate; slow for large images
HistogramModerateLowO(256)MediumColour match only; ignores structure
FeatureMatchHighVery highO(N×D)LowWorks on crops/rotations; expensive
AudioFingerprintHighModerateO(T)LowSpectral-peak based; codec-agnostic
MetadataLow–moderateVery lowO(1)HighUse only as a pre-filter

Bloom-filter pre-screening (DedupConfig::bloom_prescreen = true) reduces the number of pairwise comparisons by rejecting definitely-unique items before the expensive perceptual-hash phase. Recommended for libraries with > 10 K files.

LSH acceleration (DedupConfig::use_lsh = true, default) replaces O(n²) pairwise perceptual-hash comparison with sub-quadratic approximate nearest- neighbour lookup via BitLshIndex. Adjust lsh_num_tables (more tables → better recall, more memory) and lsh_bits_per_table (fewer bits → more candidates → better recall at higher CPU cost).

Re-exports§

pub use merge_strategy::AppliedAction;
pub use merge_strategy::MergeExecutor;
pub use merge_strategy::MergeReport;
pub use report::DuplicateGroup;
pub use report::DuplicateReport;
pub use report::SimilarityScore;

Modules§

audio
Audio fingerprinting and similarity detection for deduplication.
bloom_filter
Near-duplicate detection using a Bloom filter.
cluster
Duplicate clustering: similarity groups, cluster merging, representative selection.
content_id
Content ID and fingerprinting for media assets.
content_signature
Content-signature types for robust media identification.
cross_format
Cross-format duplicate detection: same content in different containers/codecs.
dedup_cache
LRU cache for deduplication hash lookups.
dedup_index
Persistent-style deduplication index.
dedup_policy
Policy types for controlling deduplication behaviour.
dedup_report
Deduplication reporting: statistics, summaries, and formatted reports.
dedup_report_ext
Extended deduplication reporting and statistics.
dedup_stats
Extended deduplication statistics: space savings, group statistics, action recommendations.
frame_hash
Frame-level hash types for fast perceptual deduplication.
fuzzy_match
Fuzzy / approximate matching for media deduplication.
hash
Cryptographic and content-based hashing for deduplication.
hash_store
Persistent hash store for deduplication lookups.
incremental
Incremental deduplication: only scan new or modified files.
lsh_index
Locality-Sensitive Hashing (LSH) index for approximate nearest-neighbour deduplication of high-dimensional media feature vectors.
merge_strategy
Merge strategies for resolving duplicate file groups.
metadata
Metadata-based deduplication and fuzzy matching.
near_duplicate
Near-duplicate detection using locality-sensitive hashing (LSH).
perceptual_hash
Perceptual hashing for image/video deduplication.
phash
Perceptual hashing (pHash) and near-duplicate detection for video frames.
progress
Progress reporting callbacks for long-running deduplication operations.
report
Duplicate detection reports and recommendations.
rolling_hash
Rolling hash for content-defined chunking in media deduplication.
segment_dedup
Segment-level deduplication for media streams.
similarity_index
Similarity index: fast lookup structures for near-duplicate candidate retrieval.
video_dedup
Video-level deduplication.
video_segment_dedup
Video segment deduplication using perceptual hashing and temporal windowing.
visual
Visual similarity detection for image and video deduplication.

Structs§

DedupConfig
Configuration for deduplication.
DedupStats
Deduplication statistics.

Enums§

DedupError
Deduplication error type.
DetectionStrategy
Detection strategy for finding duplicates.

Type Aliases§

DedupResult
Deduplication result type.