Crate vid_dup_finder_libsource · [−]
vid_dup_finder_lib is a library for creating perceptual hashes of video files and using those hashes to find duplicates.
How it works
The library generates hashes from the following information:
- Video duration
- The discrete cosine transform of ten frames from the first 30 seconds (the “spatial” component);
- The change in magnitude of each frequency component of ten frames from the first 30 seconds (the “temporal” component)
Therefore, this library will find videos which for a given tolerance “look the same”, “have the same movement” and “have the same length”
High Level API
First provide the paths to a set of video files and turn them into hashes Then, use one of the duplicate detection functions to discover which videos are duplicates of each other.
use vid_dup_finder_lib::VideoHash; use vid_dup_finder_lib::NormalizedTolerance; // Paths to some vids to search for duplicates. // Let's assume the first two videos are duplicates and the third is unrelated. let vids = [dup_vid_path_1, dup_vid_path_2, other_vid_path]; let hashes = vids.iter().map(VideoHash::from_path).map(Result::unwrap); //Perform the search. dup_groups will detect the two duplicate videos let tolerance = NormalizedTolerance::default(); let dup_groups = vid_dup_finder_lib::search(hashes, tolerance); assert_eq!(dup_groups.len(), 1); assert_eq!(dup_groups.len(), 2);
The following search functions are available:
- To find all duplicate videos within a set:
- To find all duplicate videos using a set of reference videos:
To generate the hashes this library must decode the first 30 seconds of each video it processes if there are a lot of viedos this takes a very long time. There is a companion crate called video_hash_filesystem_cache which will store caches on disk in between searches, reducing the amount of time spent loading videos.
The library is specifically designed to find near-duplicate videos (i.e ones that have not been significantly edited).
Many transformations are capable of defeating it, such as rotation/flipping, watermarking, or time-offsetting.
However if the transformations are minor (a faint watermark, a small crop etc) then this library should still detect duplicate videos.
Because the aim of this library is to find near-duplicates, the temporal and spatial hashes are generated from the first 30 seconds of video content. This saves time
Because this library is designed to detect near duplicates it only looks at the first 30 seconds of any video. Therefore it is completely incapable of detecting if one video is a portion of another. For example you cannot use it to detect a duplicate scene from an entire movie.
This library is will not defeat “classic” methods of hiding duplicates, such as horizontal mirroring, changing playback speed, or embedding video content in the corner of a static frame.
todo: The concepts in this library can be extended to be able to detect the subset-videos described above
Because this library only checks the first 30 seconds of each video, if two videos are the same length and share the first 30 seconds of video content, they will be reported as a false match. This may occur for TV shows which contain opening credits.
This crate calls Ffmpeg from the command line. You must make Ffmpeg and Ffprobe available on the command line, for example:
- Debian-based systems:
# apt-get install ffmpeg
- Yum-based systems:
# yum install ffmpeg
- Download the correct installer from https://ffmpeg.org/download.html
- Run the installer and install ffmpeg to any directory
- Add the directory into the PATH environment variable
Unfortunately this requirement exists due to technical reasons (no documented, and memory-leak-free bindings exist to ffmpeg and/or gstreamer) and licensing reasons (statically linking to Ffmpeg may introduce additional transitive licensing requirements on end users of this library),
A group of duplicate videos detected by crate::search or crate::search_with_references.
The distance between two VideoHashes, in the range 0..=1
Tolerance to be applied when searching for duplicates.
The distance between two VideoHash objects.
A hash of a video file, used for video duplicate detection. The hash contains information about the first 30 seconds of a video, and also the duration. Searches will use these data to determine similarity.
Error type for the various reasons why a VideoHash could not be created from a video file.
Search for duplicates within the given hashes, within the given tolerance. Returns groups for all the matching videos. Each group may have multiple entries if multiple videos are duplicates of each other.
Given a set of ‘reference’ videos, search new_hashes for all duplicate videos. Returns a set of groups, one group for each reference video that was matched.