YADF — Yet Another Dupes Finder
It's fast on my machine.
Installation
Prebuilt Packages
Executable binaries for some platforms are available in the releases section.
Building from source
- Install Rust Toolchain
- Run
cargo install yadf
Usage
yadf defaults:
- search current working directory
$PWD - output format is the same as the "standard"
fdupes, newline separated groups - descends automatically into subdirectories
- search includes every files (including empty files)
| |
Filtering
Formatting
Look up the help for a list of output formats yadf -h.
yadf 0.13.1
Yet Another Dupes Finder
USAGE:
yadf [FLAGS] [OPTIONS] [paths]...
FLAGS:
-H, --hard-links Treat hard links to same file as duplicates
-h, --help Prints help information
-n, --no-empty Excludes empty files
-q, --quiet Pass many times for less log output
-V, --version Prints version information
-v, --verbose Pass many times for more log output
OPTIONS:
-a, --algorithm <algorithm> Hashing algorithm [default: AHash] [possible values: AHash,
Highway, MetroHash, SeaHash, XxHash]
-f, --format <format> Output format [default: Fdupes] [possible values: Csv, Fdupes,
Json, JsonPretty, LdJson, Machine]
--max <size> Maximum file size
-d, --depth <depth> Maximum recursion depth
--min <size> Minimum file size
-p, --pattern <glob> Check files with a name matching a glob pattern, see:
https://docs.rs/globset/0.4.6/globset/index.html#syntax
-R, --regex <regex> Check files with a name matching a Perl-style regex, see:
https://docs.rs/regex/1.4.2/regex/index.html#syntax
--rfactor <rfactor> Replication factor [under|equal|over]:n
ARGS:
<paths>... Directories to search
For sizes, K/M/G/T[B|iB] suffixes can be used (case-insensitive).
Notes on the algorithm
Most¹ dupe finders follow a 3 steps algorithm:
- group files by their size
- group files by their first few bytes
- group files by their entire content
yadf skips the first step, and only does the steps 2 and 3, preferring hashing rather than byte comparison. In my tests having the first step on a SSD actually slowed down the program.
yadf makes heavy use of the standard library BTreeMap, it uses a cache aware implementation avoiding too many cache misses. yadf uses the parallel walker provided by ignore (disabling its ignore features) and rayon's parallel iterators to do each of these 2 steps in parallel.
¹: some need a different algorithm to support different features or different performance trade-offs
Design goals
I sought out to build a high performing artefact by assembling together libraries doing the actual work, nothing here is custom made, it's all "off-the-shelf" software.
Benchmarks
The performance of yadf is heavily tied to the hardware, specifically the
NVMe SSD. I recommend fclones as it has more hardware heuristics. and in general more features.
My home directory contains upwards of 700k paths and 39 GB of data, and is probably a pathological case of file duplication with all the node_modules, python virtual environments, rust target, etc. Arguably, the most important column here is the mean time when the filesystem cache is cold.
| Program | Version | Warm Mean time (s) | Cold Mean time (s) |
|---|---|---|---|
| yadf | 0.13.1 | 2.812 | 21.554 |
| fclones | 0.8.0 | 4.111 | 19.452 |
| jdupes | 1.14.0 | 11.815 | 129.132 |
| ddh | 0.11.3 | 10.424 | 27.241 |
| fddf | 1.7.0 | 5.595 | 32.661 |
| rmlint | 2.9.0 | 17.516 | 67.580 |
| dupe-krill | 1.4.4 | 8.791 | 127.860 |
fdupes is excluded from this benchmark because it's really slow.
The script used to benchmark can be read here.
Extract from neofetch and hwinfo --disk:
- OS: Ubuntu 20.04.1 LTS x86_64
- Host: XPS 15 9570
- Kernel: 5.4.0-42-generic
- CPU: Intel i9-8950HK (12) @ 4.800GHz
- Memory: 4217MiB / 31755MiB
- Disk:
- model: "SK hynix Disk"
- driver: "nvme"