compare-dir 0.8.0

A high-performance directory comparison tool and library
Documentation

CI-badge crate-badge docs-badge

compare-dir

A command line tool to compare two directories and show the differences. It can also find changed, corrupted, and duplicated files.

Installation

Prerequisites: Install Rust if it's not installed yet.

Then the following command installs compare-dir from crates.io:

cargo install compare-dir

See Releases for the change history.

Usages

compare-dir supports the following features:

Compare Directories

When you backup your important data, it is important to keep in mind that the backup may not be done correctly, or the backup may somehow corrupt. They can happen more often than you might imagine.

The following example compares two directories. The comparison is done first by the modified time and sizes. It also compares file contents if the file sizes are the same, to verify backup copies are not corrupted. Please see compare files for more details.

compare-dir <dir1> <dir2>

Compare Files

When comparing files, comparing byte-to-byte is faster if you compare them only once, but comparing hashes is faster if you compare them multiple times because hashes are saved in the hash cache.

The --compare (or -c) option can change how files are compared.

--compare Meaning
size Compare only by file sizes.
hash Compare file contents by their hashes.
rehash Same as hash, but recompute hashes without using the data in the hash cache.
full Compare file contents byte-by-byte.

Symbols

When comparing two directories, the output is human-readable by default. The --symbol (or -s) option changes the output format to be symbolized, which is easier for programs to read.

Position Character Meaning
1st = In both directories.
> Only in dir1.
< Only in dir2.
2nd = Modified time are the same.
> dir1 is newer.
< dir2 is newer.
Modified time are not comparable.
3rd = Same file sizes and contents.
! Same file sizes but contents differ.
> dir1 is larger.
< dir2 is larger.
Sizes are not comparable.

For example:

=>= dir/path

means that dir/path in dir1 is newer than the file in dir2, but they have the same file sizes and contents.

The following bash example creates a list of paths of the same file sizes, but different contents. They often indicate possible copy failures or corruptions.

compare-dir -s <dir1> <dir2> | grep '^..!' | cut -c 5-

If you prefer sed over cut:

compare-dir -s <dir1> <dir2> | grep '^..!' | sed 's/^....//'

To do this in PowerShell:

compare-dir -s <dir1> <dir2> | sls '^..!' | %{$_ -replace '^....',''}

Find Changed or Corrupted Files

compare-dir can find changed files by comparing hashes with the previously saved hashes in the hash cache. This is useful when there could be possible corruptions, such as after unexpected power down or RAID rebuild.

First, the hash cache needs to be created. There are two ways to do this.

  • Comparing directories creates it automatically, when it runs with the -c hash option (default).
  • Use the -c update option.
    compare-dir -c update <dir>
    

Then the -c check option can find changed files.

compare-dir -c check <dir>

It prints a symbol, followed by the path.

Symbol Meaning
+ The file isn't in the hash cache.
! The file is changed.

The -c check option doesn't update the hash cache, so that you can run it multiple times. If you want to update the hash cache, please use -c update option instead. This option prints the same output as -c check, but also updates the hash cache.

Find Duplicates

compare-dir discovers exact duplicated files with the -c dup option.

compare-dir -c dup <dir>

Hash

compare-dir uses the blake3 hash algorithm.

Hash Cache

File hashes are saved to a file named .hash_cache.

This is used in many ways, depending on the --compare option.

--compare Hash cache usage
full, size Not used.
hash, dup Used if modified time doesn't change, updated otherwise.
rehash Updated.
check Used.
update Used and updated.

Invalidation

When comparing files with the -c hash option (default), hashes in the hash cache are used if the modified time doesn't change.

If file contents are changed without changing their modified time, the cache needs to be invalidated. You can invalidate the hash cache by the -c rehash option, or by deleting the cache file.

The following example shows a scenario where a different content is found, make a new backup copy, and rehash the cache.

% compare-dir /master /backup
dir1/dir2/file: Contents differ
% cp /master/dir1/dir2/file /backup/dir1/dir2
% compare-dir -c rehash /master/dir1/dir2/file /backup/dir1/dir2

[!NOTE] When the first argument is a file, not a directory, only the specified file is compared. The -c rehash option in this mode invalidates the hash cache only for the file, retaining hash caches for other files in the directory and its sub directories.

Backup Strategy

When backing up, there are two strategies you can take.

Exclude .hash_cache

  1. Exclude .hash_cache when backing up.
  2. Use compare-dir <dir1> <dir2> to verify.
  3. Once the comparison is complete, the hash caches are updated for both directories. You can use compare-dir -c check <backup-dir> to verify the backup data isn't changed or corrupted since the last comparison.

This method is suitable for incremental backups, as the step 2 computes hashes only for updated files.

Include .hash_cache

  1. Update the cache in the source by compare-dir -c update <source-dir>.
  2. Include .hash_cache when backing up.
  3. Use compare-dir -c check <backup-dir> to verify.

Hash Cache Directory

If a .hash_cache cannot be found in the specified directory, compare-dir will walk up the file tree until it finds one. If none is found, a new .hash_cache is created in the specified directory.

For example:

touch /data/.hash_cache
compare-dir /data/dir
compare-dir /data/dir2
compare-dir /data

All three runs of compare-dir use the same hash cache file at ~/data/.hash_cache.