compare-dir
A command line tool to compare two directories and show the differences. It can also find changed, corrupted, and duplicated files.
Installation
Prerequisites: Install Rust if it's not installed yet.
Then the following command installs compare-dir
from crates.io:
cargo install compare-dir
See Releases for the change history.
Usages
compare-dir supports the following features:
Compare Directories
When you backup your important data, it is important to keep in mind that the backup may not be done correctly, or the backup may somehow corrupt. They can happen more often than you might imagine.
The following example compares two directories. The comparison is done first by the modified time and sizes. It also compares file contents if the file sizes are the same, to verify backup copies are not corrupted. Please see compare files for more details.
compare-dir <dir1> <dir2>
Compare Files
When comparing files, comparing byte-to-byte is faster if you compare them only once, but comparing hashes is faster if you compare them multiple times because hashes are saved in the hash cache.
The --compare (or -c) option can change
how files are compared.
--compare |
Meaning |
|---|---|
| size | Compare only by file sizes. |
| hash | Compare file contents by their hashes. |
| rehash | Same as hash, but recompute hashes without using the data in the hash cache. |
| full | Compare file contents byte-by-byte. |
Symbols
When comparing two directories,
the output is human-readable by default.
The --symbol (or -s) option changes the output format to be symbolized,
which is easier for programs to read.
| Position | Character | Meaning |
|---|---|---|
| 1st | = |
In both directories. |
> |
Only in dir1. |
|
< |
Only in dir2. |
|
| 2nd | = |
Modified time are the same. |
> |
dir1 is newer. |
|
< |
dir2 is newer. |
|
|
Modified time are not comparable. | |
| 3rd | = |
Same file sizes and contents. |
! |
Same file sizes but contents differ. | |
> |
dir1 is larger. |
|
< |
dir2 is larger. |
|
|
Sizes are not comparable. |
For example:
=>= dir/path
means that dir/path in dir1 is newer than the file in dir2,
but they have the same file sizes and contents.
The following bash example creates a list of paths of the same file sizes, but different contents. They often indicate possible copy failures or corruptions.
| |
If you prefer sed over cut:
| |
To do this in PowerShell:
compare-dir -s <dir1> <dir2> | sls '^..!' | %{$_ -replace '^....',''}
Find Changed or Corrupted Files
compare-dir can find changed files
by comparing hashes with the previously saved hashes in the hash cache.
This is useful when there could be possible corruptions,
such as after unexpected power down or RAID rebuild.
First, the hash cache needs to be created. There are two ways to do this.
- Comparing directories creates it automatically,
when it runs with the
-c hashoption (default). - Use the
-c updateoption.compare-dir -c update <dir>
Then the -c check option can find changed files.
compare-dir -c check <dir>
It prints a symbol, followed by the path.
| Symbol | Meaning |
|---|---|
+ |
The file isn't in the hash cache. |
! |
The file is changed. |
The -c check option doesn't update the hash cache,
so that you can run it multiple times.
If you want to update the hash cache,
please use -c update option instead.
This option prints the same output as -c check,
but also updates the hash cache.
Find Duplicates
compare-dir discovers exact duplicated files
with the -c dup option.
compare-dir -c dup <dir>
Hash
compare-dir uses the blake3 hash algorithm.
Hash Cache
File hashes are saved to a file named .hash_cache.
This is used in many ways,
depending on the --compare option.
--compare |
Hash cache usage |
|---|---|
| full, size | Not used. |
| hash, dup | Used if modified time doesn't change, updated otherwise. |
| rehash | Updated. |
| check | Used. |
| update | Used and updated. |
Invalidation
When comparing files with the -c hash option (default),
hashes in the hash cache are used if the modified time doesn't change.
If file contents are changed without changing their modified time,
the cache needs to be invalidated.
You can invalidate the hash cache
by the -c rehash option,
or by deleting the cache file.
The following example shows a scenario where a different content is found, make a new backup copy, and rehash the cache.
% compare-dir /master /backup
dir1/dir2/file: Contents differ
% cp /master/dir1/dir2/file /backup/dir1/dir2
% compare-dir -c rehash /master/dir1/dir2/file /backup/dir1/dir2
[!NOTE] When the first argument is a file, not a directory, only the specified file is compared. The
-c rehashoption in this mode invalidates the hash cache only for the file, retaining hash caches for other files in the directory and its sub directories.
Backup Strategy
When backing up, there are two strategies you can take.
Exclude .hash_cache
- Exclude
.hash_cachewhen backing up. - Use
compare-dir <dir1> <dir2>to verify. - Once the comparison is complete,
the hash caches are updated for both directories.
You can use
compare-dir -c check <backup-dir>to verify the backup data isn't changed or corrupted since the last comparison.
This method is suitable for incremental backups, as the step 2 computes hashes only for updated files.
Include .hash_cache
- Update the cache in the source by
compare-dir -c update <source-dir>. - Include
.hash_cachewhen backing up. - Use
compare-dir -c check <backup-dir>to verify.
Hash Cache Directory
If a .hash_cache cannot be found in the specified directory,
compare-dir will walk up the file tree until it finds one.
If none is found, a new .hash_cache is created in the specified directory.
For example:
All three runs of compare-dir use
the same hash cache file at ~/data/.hash_cache.