YADF — Yet Another Dupes Finder
It's fast on my machine.
You should probably use fclones.
Installation
Prebuilt Packages
Executable binaries for some platforms are available in the releases section.
Building from source
- Install Rust Toolchain
- Run
cargo install yadf
Usage
yadf defaults:
- search current working directory
$PWD - output format is the same as the "standard"
fdupes, newline separated groups - descends automatically into subdirectories
- search includes every files (including empty files)
| |
Filtering
Formatting
Look up the help for a list of output formats yadf -h.
yadf 0.13.1
Yet Another Dupes Finder
USAGE:
yadf [FLAGS] [OPTIONS] [paths]...
FLAGS:
-H, --hard-links Treat hard links to same file as duplicates
-h, --help Prints help information
-n, --no-empty Excludes empty files
-q, --quiet Pass many times for less log output
-V, --version Prints version information
-v, --verbose Pass many times for more log output
OPTIONS:
-a, --algorithm <algorithm> Hashing algorithm [default: AHash] [possible values: AHash,
Highway, MetroHash, SeaHash, XxHash]
-f, --format <format> Output format [default: Fdupes] [possible values: Csv, Fdupes,
Json, JsonPretty, LdJson, Machine]
--max <size> Maximum file size
-d, --depth <depth> Maximum recursion depth
--min <size> Minimum file size
-p, --pattern <glob> Check files with a name matching a glob pattern, see:
https://docs.rs/globset/0.4.6/globset/index.html#syntax
-R, --regex <regex> Check files with a name matching a Perl-style regex, see:
https://docs.rs/regex/1.4.2/regex/index.html#syntax
--rfactor <rfactor> Replication factor [under|equal|over]:n
ARGS:
<paths>... Directories to search
For sizes, K/M/G/T[B|iB] suffixes can be used (case-insensitive).
Notes on the algorithm
Most¹ dupe finders follow a 3 steps algorithm:
- group files by their size
- group files by their first few bytes
- group files by their entire content
yadf skips the first step, and only does the steps 2 and 3, preferring hashing rather than byte comparison. In my tests having the first step on a SSD actually slowed down the program.
yadf makes heavy use of the standard library BTreeMap, it uses a cache aware implementation avoiding too many cache misses. yadf uses the parallel walker provided by ignore (disabling its ignore features) and rayon's parallel iterators to do each of these 2 steps in parallel.
¹: some need a different algorithm to support different features or different performance trade-offs
Design goals
I sought out to build a high performing artefact by assembling together libraries doing the actual work, nothing here is custom made, it's all "off-the-shelf" software.
Benchmarks
The performance of yadf is heavily tied to the hardware, specifically the
NVMe SSD. I recommend fclones as it has more hardware heuristics. and in general more features. yadf on HDDs is terrible.
My home directory contains upwards of 700k paths and 39 GB of data, and is probably a pathological case of file duplication with all the node_modules, python virtual environments, rust target, etc. Arguably, the most important measure here is the mean time when the filesystem cache is cold.
| Program (warm filesystem cache) | Version | Mean [s] | Min [s] | Max [s] | Relative |
|---|---|---|---|---|---|
fclones |
0.29.3 | 7.435 ± 1.609 | 4.622 | 9.317 | 2.35 ± 0.70 |
jdupes |
1.14.0 | 16.787 ± 0.208 | 16.484 | 17.178 | 5.32 ± 1.08 |
ddh |
0.13 | 12.703 ± 1.547 | 10.814 | 14.793 | 4.02 ± 0.95 |
dupe-krill |
1.4.7 | 15.555 ± 1.633 | 12.486 | 16.959 | 4.93 ± 1.12 |
fddf |
1.7.0 | 18.441 ± 1.947 | 15.097 | 22.389 | 5.84 ± 1.33 |
yadf |
1.1.0 | 3.157 ± 0.638 | 2.362 | 4.175 | 1.00 |
| Program (cold filesystem cache) | Version | Mean [s] | Min [s] | Max [s] | Relative |
|---|---|---|---|---|---|
fclones |
0.29.3 | 68.950 ± 3.694 | 63.165 | 73.534 | 2.30 ± 0.36 |
jdupes |
1.14.0 | 303.907 ± 11.578 | 277.618 | 314.226 | 10.16 ± 1.53 |
yadf |
1.1.0 | 52.481 ± 1.125 | 50.412 | 54.265 | 1.75 ± 0.26 |
I test less programs here because it takes several hours to run.
The script used to benchmark can be read here.
Extract from neofetch and hwinfo --disk:
- OS: Ubuntu 20.04.1 LTS x86_64
- Host: XPS 15 9570
- Kernel: 5.4.0-42-generic
- CPU: Intel i9-8950HK (12) @ 4.800GHz
- Memory: 4217MiB / 31755MiB
- Disk:
- model: "SK hynix Disk"
- driver: "nvme"