Skip to main content

Crate oximedia_dedup

Crate oximedia_dedup 

Source
Expand description

Media deduplication and duplicate detection for OxiMedia.

oximedia-dedup provides comprehensive duplicate detection and media deduplication for the OxiMedia multimedia framework. This includes:

  • Cryptographic hashing: BLAKE3-based exact duplicate detection
  • Visual similarity: Perceptual hashing, SSIM, histogram, and feature matching
  • Audio fingerprinting: Audio fingerprint comparison and waveform similarity
  • Metadata matching: Fuzzy metadata comparison for near-duplicates
  • Storage optimization: Fast SQLite-based indexing for large libraries
  • Reporting: Comprehensive duplicate reports with similarity scoring

§Modules

  • hash: Cryptographic and content-based hashing
  • visual: Visual similarity detection
  • audio: Audio fingerprint comparison
  • metadata: Metadata-based deduplication
  • database: SQLite-based indexing and lookup
  • report: Duplicate detection reports

§Example

use oximedia_dedup::{DuplicateDetector, DetectionStrategy, DedupConfig};

let config = DedupConfig::default();
let mut detector = DuplicateDetector::new(config).await?;

// Add files to the index
detector.add_file("/path/to/video1.mp4").await?;
detector.add_file("/path/to/video2.mp4").await?;

// Find duplicates
let duplicates = detector.find_duplicates(DetectionStrategy::All).await?;

§Strategy Selection Guide

StrategySpeedPrecisionUse case
ExactHashVery fastPerfect (no false positives)Bit-for-bit identical files — best first pass for any library
FastFastHighQuick scan: hash + perceptual + metadata; good default for large libraries
PerceptualHashFastGoodVisually identical images/frames that were re-encoded or lightly cropped
HistogramFastModerateColor-similar frames regardless of spatial layout
AudioFingerprintModerateHighSame audio in different codecs or with minor edits
MetadataFastLow–moderateLikely duplicates with same duration/resolution (combine with visual pass)
SsimSlowVery highNear-identical video frames where pHash gives too many false positives
FeatureMatchSlowHighCropped, rotated, or partially occluded duplicates
VisualAllSlowVery highCombined visual pipeline: pHash + Histogram + SSIM + FeatureMatch
AllVery slowMaximumFull pipeline — all methods; use only for final authoritative scan

Recommended workflow:

  1. Run ExactHash first to catch perfect duplicates cheaply.
  2. Run Fast for a broad near-duplicate sweep.
  3. Run VisualAll or All for a precision clean-up pass on the remainder.

§Detection Method Trade-offs

MethodAccuracyCPU costMemoryFalse-positive riskNotes
BLAKE3 hash100%Very lowO(1)NoneMisses re-encoded or edited copies
dHash (8×8)HighVery lowO(1)LowRobust to resize; sensitive to crops
pHash (DCT)HighLowO(1)Low–mediumBetter than dHash for brightness shifts
wHash (wavelet)HighLowO(1)LowMost robust to combined transforms
SSIMVery highHighO(WH)Very lowPixel-accurate; slow for large images
HistogramModerateLowO(256)MediumColour match only; ignores structure
FeatureMatchHighVery highO(N×D)LowWorks on crops/rotations; expensive
AudioFingerprintHighModerateO(T)LowSpectral-peak based; codec-agnostic
MetadataLow–moderateVery lowO(1)HighUse only as a pre-filter

Bloom-filter pre-screening (DedupConfig::bloom_prescreen = true) reduces the number of pairwise comparisons by rejecting definitely-unique items before the expensive perceptual-hash phase. Recommended for libraries with > 10 K files.

LSH acceleration (DedupConfig::use_lsh = true, default) replaces O(n²) pairwise perceptual-hash comparison with sub-quadratic approximate nearest- neighbour lookup via BitLshIndex. Adjust lsh_num_tables (more tables → better recall, more memory) and lsh_bits_per_table (fewer bits → more candidates → better recall at higher CPU cost).

Re-exports§

pub use merge_strategy::AppliedAction;
pub use merge_strategy::MergeExecutor;
pub use merge_strategy::MergeReport;
pub use report::DuplicateGroup;
pub use report::DuplicateReport;
pub use report::SimilarityScore;

Modules§

audio
Audio fingerprinting and similarity detection for deduplication.
audio_fingerprint
Audio fingerprinting for deduplication.
bloom_filter
Near-duplicate detection using a Bloom filter.
bloom_prescreen
Bloom filter pre-screening for deduplication pipelines.
chromagram
Chromagram-based audio feature extraction for music deduplication.
cluster
Duplicate clustering: similarity groups, cluster merging, representative selection.
content_id
Content ID and fingerprinting for media assets.
content_signature
Content-signature types for robust media identification.
cross_format
Cross-format duplicate detection: same content in different containers/codecs.
dedup_cache
LRU cache for deduplication hash lookups.
dedup_index
Persistent-style deduplication index.
dedup_policy
Policy types for controlling deduplication behaviour.
dedup_queue
Deduplication work queue with priority scheduling.
dedup_report
Deduplication reporting: statistics, summaries, and formatted reports.
dedup_report_detailed
Detailed deduplication reporting with disk-space savings, confidence scores, and action recommendations.
dedup_report_ext
Extended deduplication reporting and statistics.
dedup_stats
Extended deduplication statistics: space savings, group statistics, action recommendations.
exact_match
Exact file and content matching for deduplication.
frame_hash
Frame-level hash types for fast perceptual deduplication.
fuzzy_match
Fuzzy / approximate matching for media deduplication.
hash
Cryptographic and content-based hashing for deduplication.
hash_store
Persistent hash store for deduplication lookups.
hierarchical
Hierarchical deduplication: fast pass (hash) -> medium pass (perceptual) -> slow pass (SSIM).
incremental
Incremental deduplication: only scan new or modified files.
lsh_index
Locality-Sensitive Hashing (LSH) index for approximate nearest-neighbour deduplication of high-dimensional media feature vectors.
merge_strategy
Merge strategies for resolving duplicate file groups.
metadata
Metadata-based deduplication and fuzzy matching.
minhash
MinHash-based approximate similarity estimation.
near_duplicate
Near-duplicate detection using locality-sensitive hashing (LSH).
near_duplicate_cluster
Near-duplicate clustering with union-find, threshold-based merging, cluster representative selection, and cluster statistics.
network_dedup
Network-aware deduplication for distributed media libraries.
parallel_indexer
Parallel bulk file indexing for large media libraries.
perceptual_hash
Perceptual hashing for image/video deduplication.
persistent_cache
Cross-session persistent cache for decoded thumbnails and media fingerprints.
phash
Perceptual hashing (pHash) and near-duplicate detection for video frames.
progress
Progress reporting callbacks for long-running deduplication operations.
report
Duplicate detection reports and recommendations.
rolling_hash
Rolling hash for content-defined chunking in media deduplication.
segment_dedup
Segment-level deduplication for media streams.
signature_store
In-memory signature storage with lookup and expiration.
similarity_index
Similarity index: fast lookup structures for near-duplicate candidate retrieval.
space_savings
Disk space savings estimation for duplicate file groups.
stream_dedup
Streaming duplicate detection without loading entire files into memory.
video_dedup
Video-level deduplication.
video_dedup_pipeline
Full video deduplication pipeline.
video_segment_dedup
Video segment deduplication using perceptual hashing and temporal windowing.
visual
Visual similarity detection for image and video deduplication.

Structs§

DedupConfig
Configuration for deduplication.
DedupStats
Deduplication statistics.

Enums§

DedupError
Deduplication error type.
DetectionStrategy
Detection strategy for finding duplicates.

Type Aliases§

DedupResult
Deduplication result type.