Function fclones::group_files[][src]

pub fn group_files(
    config: &GroupConfig,
    log: &Log
) -> Result<Vec<FileGroup<Path>>, Error>
Expand description

Groups identical files together by 128-bit hash of their contents. Depending on filtering settings, can find unique, duplicate, over- or under-replicated files.

Input

The input set of files or paths to scan should be given in the config.paths property. When config.recursive is set to true, the search descends into subdirectories recursively (default is false).

Output

Returns a vector of groups of absolute paths. Each group of files has a common hash and length. Groups are sorted descending by file size.

Errors

An error is returned immediately if the configuration is invalid. I/O errors during processing are logged as warnings and unreadable files are skipped. If panics happen they are likely a result of a bug and should be reported.

Performance characteristics

The worst-case running time to is roughly proportional to the time required to open and read all files. Depending on the number of matching files and parameters of the query, that time can be lower because some files can be skipped from some stages of processing. The expected memory utilisation is roughly proportional the number of files and their path lengths.

Threading

This function blocks caller’s thread until all files are processed. To speed up processing, it spawns multiple threads internally. Some processing is performed on the default Rayon thread pool, therefore this function must not be called on Rayon thread pool to avoid a deadlock. The parallelism level is automatically set based on the type of storage and can be overridden in the configuration.

Algorithm

Files are grouped in multiple stages and filtered after each stage. Files that turn out to be unique at some point are skipped from further stages. Stages are ordered by increasing I/O cost. On rotational drives, an attempt is made to sort files by physical data location before each grouping stage to reduce disk seek times.

  1. Create a list of files to process by walking directory tree if recursive mode selected.
  2. Get length and identifier of each file.
  3. Group files by length.
  4. In each group, remove duplicate files with the same identifier.
  5. Group files by hash of the prefix.
  6. Group files by hash of the suffix.
  7. Group files by hash of their full contents.

Example

use fclones::log::Log;
use fclones::config::GroupConfig;
use fclones::{group_files, write_report};
use std::path::PathBuf;

let log = Log::new();
let mut config = GroupConfig::default();
config.paths = vec![PathBuf::from("/path/to/a/dir")];

let groups = group_files(&config, &log).unwrap();
println!("Found {} groups: ", groups.len());

// print standard fclones report to stdout:
write_report(&config, &log, &groups).unwrap();