Crate vid_dup_finder_lib

Source
Expand description

§Overview

vid_dup_finder_lib is a library for finding near-duplicate video files. A near-duplicate video is a file that closely resembles another but may have differences such as format, resolution, quality, or framerate.

However The library will not match video files if they have been rotated/flipped, sped up or slowed down, or embedded in the corner of another video.

§High Level API

First provide the paths to a set of video files and turn them into hashes Then, use one of the duplicate detection functions to discover which videos are duplicates of each other.

use vid_dup_finder_lib::VideoHash;
use vid_dup_finder_lib::MatchGroup;
use vid_dup_finder_lib::CreationOptions;
use vid_dup_finder_lib::ffmpeg_builder::VideoHashBuilder;


// Paths to some vids to search for duplicates.
// Let's assume the first two videos are duplicates and the third is unrelated.
let vids = [&dup_vid_path_1, &dup_vid_path_2, &other_vid_path];
let builder = VideoHashBuilder::default();
let hashes = vids.iter().map(|vid| builder.hash(vid.to_path_buf()).unwrap());

// You have to choose a tolerance between 0.0 and 1.0 for searching. Higher numbers
// mean searches will match more different videos. The default search tolerance of
// 0.3 is a good starting point for searching, but you can use a lower number if
// there are too many false positives in your results.
let tolerance = vid_dup_finder_lib::DEFAULT_SEARCH_TOLERANCE;

// Perform the search...
let dup_groups: Vec<MatchGroup> = vid_dup_finder_lib::search(hashes, tolerance);

//dup_groups will contain a single match_group..
assert_eq!(dup_groups.len(), 1);
assert_eq!(dup_groups[0].len(), 2);

//with only the duplicated videos inside.
let dup_files: Vec<&Path> = dup_groups[0].duplicates().collect();
assert!(dup_files.contains(&dup_vid_path_1.as_path()));
assert!(dup_files.contains(&dup_vid_path_2.as_path()));
assert!(!dup_files.contains(&other_vid_path.as_path()));

§Prerequisites

This crate calls Ffmpeg from the command line. You must make Ffmpeg and Ffprobe available on the command line, for example:

  • Debian-based systems: # apt-get install ffmpeg
  • Yum-based systems: # yum install ffmpeg
  • Windows:
    1. Download the correct installer from https://ffmpeg.org/download.html
    2. Run the installer and install ffmpeg to any directory
    3. Add the directory into the PATH environment variable

Unfortunately this requirement exists due to technical reasons (no documented, and memory-leak-free bindings exist to ffmpeg) and licensing reasons (statically linking to Ffmpeg may introduce additional transitive licensing requirements on end users of this library),

§How it works

To generate a hash from a video file, this library reads N (e.g 64) frames from the first few seconds of the video file (or if the file is shorter it reads N frames evenly spaced across the entire video). It then resizes each frame down to NxN pixels in size, forming a NxNxN matrix. The three-dimensional discrete cosine transform of this matrix is calculated resulting in another NxNxN matrix. Due to the ‘energy compaction’ quality of the DCT, the MxMxM “sub matrix” (e.g 5x5x5) at one corner of the NxNxN matrix contains the majority of information about the the content of the video. A hash is build where each bit is the positive/negative magnitude of each bin in the MxMxM cube. The length of the video is also included in the hash, as this can be used to speed up searching.

You can then use the library to search with these hashes. Searches will return any group of videos with a similar length, and whose hashes differ by less than a set threshold.

§Search functions

The following search functions are available:

§Caching

To generate the hashes this library must decode the first 20 seconds of each video it processes if there are a lot of viedos this takes a very long time. There is a companion crate called video_hash_filesystem_cache which will store caches on disk in between searches, reducing the amount of time spent loading videos.

§Limitations

The library is specifically designed to find near-duplicate videos (i.e ones that have not been significantly edited). Many transformations are capable of defeating it, such as rotation/flipping, watermarking, or time-offsetting.
However if the transformations are minor (a faint watermark, a small crop etc) then this library should still detect duplicate videos.

Because the aim of this library is to find near-duplicates, the hashes are generated from the first 20 seconds of video content to save time.

This library is will not defeat “classic” methods of hiding duplicates, such as horizontal mirroring, changing playback speed, or embedding video content in the corner of a static frame.

§False Positives

Because this library only checks the first few seconds of each video, if two videos are the same length and share the first few seconds of video content, they will be reported as a false match. This may occur for TV shows which contain opening credits.

Modules§

  • A factory for video hashes, using the ffmpeg backend. (This is the preferred backend as it is more reliable than gstreamer)

Structs§

  • Options for how videos will be processed when generating hashes. Can be used to ensure that starting credits are skipped.
  • A group of duplicate videos detected by crate::search or crate::search_with_references.
  • A hash of a video file, used for video duplicate detection. The hash contains information about the first 30 seconds of a video, and also the duration. Searches will use these data to determine similarity.

Enums§

  • Algorithms to detect black bars around the edges of video frames
  • An error that prevented a video hash from being created.

Constants§

  • The default tolerance when performing searches. A value of 0.0 means videos will get paired only if their hashes are identical. A value of 1.0 means a video hash will match any other. Reccomend to start with a high value e.g 0.35 and to lower it if there are too many false positives
  • The default time at the start of the video to generate hashes from. Lower values speed up the hashing process because less video data needs to be extracted. Higher values produce slightly more reliable hashes.
  • The default time to skip forwards when before extracting video frames. Used to skip past title credits and/or overlays at the beginning of videos. Higher numbers extend hasing time (because seeking to this point in videos must be done accurately). Lower numbers risk not skipping far enough to avoid title credits etc.

Functions§

  • Search for duplicates within the given hashes, within the given tolerance. Returns groups for all the matching videos. Each group may have multiple entries if multiple videos are duplicates of each other.
  • Search new_hashes for all videos that are duplicates of videos in ref_hashes. Returns a set of groups, one group for each reference video that was matched.