compare-dir
A command line tool to compare two directories and show the differences. It can also find changed, corrupted, and duplicated files.
Installation
Prerequisites: Install Rust if it's not installed yet.
Then the following command installs compare-dir
from crates.io:
cargo install compare-dir
See Releases for the change history.
Usages
compare-dir supports following features:
Please use the --compare (-c for short) option
to specify the feature.
--compare |
Meaning |
|---|---|
| auto (default) | Same as check if single argument, hash if two arguments. |
| size | Compare directories. Files are compared only by file sizes. |
| hash | Compare directories. Files contents are compared by their hashes. |
| rehash | Same as hash, but recompute hashes without using the data in the hash cache. |
| full | Compare directories. Files contents are compared byte-by-byte. |
| check | Find changed or corrupted Files. |
| update | Same as check, but also update the hash cache. |
| dup | Find duplicated files. |
Compare Directories
When you backup your important data, it is important to keep in mind that the backup may not be done correctly, or the backup may somehow corrupt. They can happen more often than you might imagine.
The following example compares two directories. The comparison is done first by the modified time and sizes. It also compares file contents if the file sizes are the same, to verify backup copies are not corrupted. Please see compare files for more details.
compare-dir <dir1> <dir2>
Compare Files
When comparing files, comparing byte-to-byte is faster if you compare them only once, but comparing hashes is faster if you compare them multiple times because hashes are saved in the hash cache.
Please see the --compare option to change how files are compared.
Output Formats
When comparing two directories,
the output is human-readable by default.
The --out (or -o) option can change the output format.
--out |
Alias | Meaning |
|---|---|---|
| default | d | Human-readable output. |
| symbol | s | Symbolized output, easier for programs to read. See symbols. |
For backward compatibility, --symbol (or -s) is also supported and is equivalent to --out symbol.
Symbols
The symbolized format (--out symbol) output is as follows:
| Position | Character | Meaning |
|---|---|---|
| 1st | = |
In both directories. |
> |
Only in dir1. |
|
< |
Only in dir2. |
|
| 2nd | = |
Modified time are the same. |
> |
dir1 is newer. |
|
< |
dir2 is newer. |
|
|
Modified time are not comparable. | |
| 3rd | = |
Same file sizes and contents. |
! |
Same file sizes but contents differ. | |
> |
dir1 is larger. |
|
< |
dir2 is larger. |
|
|
Sizes are not comparable. |
For example:
=>= dir/path
means that dir/path in dir1 is newer than the file in dir2,
but they have the same file sizes and contents.
The following bash example creates a list of paths of the same file sizes, but different contents. They often indicate possible copy failures or corruptions.
| |
If you prefer sed over cut:
| |
To do this in PowerShell:
(compare-dir -o symbol <dir1> <dir2>) -match '^..!' -replace '^....'
Find Changed or Corrupted Files
compare-dir can find changed files
by comparing hashes with the previously saved hashes in the hash cache.
This is useful when there could be possible corruptions,
such as after unexpected power down or RAID rebuild.
First, the hash cache needs to be created. There are two ways to do this.
- Comparing directories creates it automatically,
when it runs with the
-c hashoption (default). - Use the
-c updateoption.compare-dir -c update <dir>
Then the -c check option can find changed files.
compare-dir -c check <dir>
This prints the same output format as compare directories,
as if it compares the current files
against the files when the hash cache was created.
For example, with the -s option (the output format is symbols):
< file1
=<< file2
=<! file3
This means that:
file1was added.file2became newer and larger.file3became newer and different content, but the size didn't change.
The -c check option doesn't update the hash cache,
so that you can run it multiple times.
If you want to update the hash cache,
please use -c update option instead.
This option prints the same output as -c check,
but also updates the hash cache.
Find Duplicated Files
compare-dir discovers exact duplicated files
with the -c dup option.
compare-dir -c dup <dir>
Finding duplicated files from multiple directories is also supported.
compare-dir -c dup <dir1> <dir2> <dir3>
Output Formats
--out |
Alias | Meaning |
|---|---|---|
| default | d | Human-readable output. |
| yaml | y, yml | YAML format. |
The --out yaml (or -o yaml / -o y / -o yml) outputs the results in the YAML format.
You can use other tools such as yq to
convert the YAML results to JSON or other formats.
compare-dir -o yaml -c dup <dir> | yq -o json
Backup Strategies
When backing up, there are two strategies you can take.
Strategy 1: Exclude .hash_cache
- Backup by excluding the hash cache file.
On Windows, usingrsync -av --delete --exclude .hash_cache /path/to/source /path/to/backuprobocopy:robocopy \path\to\source \path\to\backup /MIR /XF .hash_cache - Use
compare-dir <dir1> <dir2>to verify.compare-dir /path/to/source/ /path/to/backup/ - Check backup files are not changed or corrupted
since the last comparison (step 2 above).
compare-dir -c check /path/to/backup
This method is suitable for incremental backups, as the step 2 computes hashes only for updated files.
Strategy 2: Include .hash_cache
- Update the hash cache in the source directory.
compare-dir -c update /path/to/source - Backup all files, including the hash cache file.
rsync -av --delete /path/to/source/ /path/to/backup/ - Verify that the hashes of the backup files match the cached hashes.
compare-dir -c check /path/to/backup
Hash
compare-dir uses the blake3 hash algorithm.
Hash Cache
File hashes are saved to a file named .hash_cache.
This is used in many ways,
depending on the --compare option.
--compare |
Hash cache usage |
|---|---|
| full, size | Not used. |
| hash, dup | Used if modified time doesn't change, updated otherwise. |
| rehash | Updated. |
| check | Used. |
| update | Used and updated. |
Invalidation
When comparing files with the -c hash option (default),
hashes in the hash cache are used if the modified time doesn't change.
If file contents are changed without changing their modified time,
the cache needs to be invalidated.
You can invalidate the hash cache
by the -c rehash option,
or by deleting the cache file.
The following example shows a scenario where a different content is found, make a new backup copy, and rehash the cache.
% compare-dir /master /backup
dir1/dir2/file: Contents differ
% cp /master/dir1/dir2/file /backup/dir1/dir2
% compare-dir -c rehash /master/dir1/dir2/file /backup/dir1/dir2
[!NOTE] When the first argument is a file, not a directory, only the specified file is compared. The
-c rehashoption in this mode invalidates the hash cache only for the file, retaining hash caches for other files in the directory and its sub directories.
Hash Cache Directory
If a .hash_cache cannot be found in the specified directory,
compare-dir will walk up the file tree until it finds one.
If none is found, a new .hash_cache is created in the specified directory.
For example:
All three runs of compare-dir use
the same hash cache file at ~/data/.hash_cache.