hashdeep-compare

hashdeep-compare is a comparison tool for log files generated by the Hashdeep file storage auditing tool.

Why use hashdeep-compare? Isn't Hashdeep enough?

Hashdeep can generate a hash digest report for every file in a storage volume or directory, making it suitable for forensic recording, bit-rot detection, or confirming directory and file changes before committing them to backups. Multiple logs of Hashdeep's output can be saved to form a historical record of a storage volume's state. Hashdeep supports the comparison of a log file to a live storage volume, but does not support comparison between log files. hashdeep-compare was created to provide this log comparison capability.

By saving log files and using hashdeep-compare, the contents of storage volumes can be compared regardless of their current availability. This allows retrospective analysis of historical hashdeep logs to compare file and directory states at different times, which supports new use cases, e.g.: determining when a file was moved, or confirming that a modified or corrupted file was still intact at a certain date.

If you're concerned about file archive bit-rot or just want to compare archived records of the content of an important directory, using Hashdeep and hashdeep-compare may be a convenient solution.

How to use hashdeep-compare

hashdeep-compare is a command-line tool with three functions:

hash: invokes hashdeep and generates a log file compatible with hashdeep-compare.

hashdeep-compare hash path/to/target_dir path/to/output_log.txt

This function is optional, but recommended to ensure log compatibility. The above function call is equivalent to directly calling hashdeep -l -r -o f path/to/target_dir > path/to/output_log.txt 2> path/to/output_log.txt.errors. Note that if the output file or the error file already exists, the command will be aborted (hashdeep-compare will not overwrite existing files).
sort: sorts the entries in a hashdeep log by file path.

hashdeep-compare sort path/to/unsorted_input.txt path/to/sorted_output.txt

hashdeep does not guarantee ordering of log entries, and ordering tends to be inconsistent between runs in practice. Sorting allows comparison of hashdeep logs in a text-diff tool, which may be the easiest way to compare logs with uncomplicated differences. Note that if the output file already exists, the command will be aborted (hashdeep-compare will not overwrite existing files).
part: the real power of hashdeep-compare: all entries will be partitioned into sets that efficiently describe the similarities and differences of the two log files.

hashdeep-compare part path/to/first_log.txt path/to/second_log.txt path/to/output_file_base

The output file base path will be used to name the output files by adding suffixes that describe the log entries represented within; it may include subdirectories. Nonexistent subdirectories will not be created; if one is specified, the command will be aborted. Note that if any of the resulting output files already exist, the command will be aborted (hashdeep-compare will not overwrite existing files).

The partitioning algorithm

When invoked with the recommended settings, Hashdeep creates a one-line log entry for each file that looks something like this:

3364240,aff470b119f69a7ad5e6999e5e6a3346,bf4fdd9d86cf23e66b456827b5dfe6e2ae52ebc9f32c7de6623aca7b665b3337,./path/example_filename.ext

This is a comma-separated string of the file's attributes: its size in bytes, the MD5 hash, the SHA256 hash, and the file path. The first three items identify the file's contents (with two separate hash algorithms to protect against hash collisions). If all three are the same for the entries of two different files, hashdeep-compare determines that the files have the same content. If at least one is different, they have different content.

Definitions:

entry: a single line in a Hashdeep log which records a single file from its target volume
hashes: an entry's file size, MD5, and SHA256 (the first 3 parts of the entry line)
name: an entry's file path (the last part of the entry line)
match: a selection of entries matched by the algorithm
match pair: a match of exactly one entry from each of the two input files
match group: a match of entries from either or both input files, but not a match pair

The hashdeep-compare partitioning algorithm compares all of the file entries from the two input logs and organizes them based on matching hashes and/or names.

When the partitioning algorithm starts, all of the entries in both input logs are loaded into a working set. Match rules are applied in a fixed order, and as matches are identified, the matched entries are removed from the working set. When the algorithm finishes, every entry will have been partitioned into exactly one match, or into one of two special sets of unmatched leftover entries.

When input logs 1 and 2 are earlier and later (respectively) records of the same file volume, these match types can imply the type of file change that was made between the creations of the logs.

Match rules, in order, with implied file changes:

Full match pairs: unchanged files
Full match groups: should never happen (duplicate names imply invalid Hashdeep logs)
Name match pairs: modified files
Name match groups: should never happen (duplicate names imply invalid Hashdeep logs)
Hashes match pairs: moved/renamed files
Hashes match groups (entries from both logs): ambiguous rename/move/copy/delete
Hashes match groups (entries only from log 1): duplicate files deleted
Hashes match groups (entries only from log 2): duplicate files created

After the match rules have been run, no more matching names or hashes will exist among the remaining entries.

unmatchable (entry from log 1): deleted files
unmatchable (entry from log 2): created files

Because each log entry is represented in exactly one match or unmatchable set, the algorithm results represent the total content of the two input logs.

The results are stored in separate files for each match rule, plus two files for unmatchable entries. These files are created by adding the following suffixes to the output file base parameter supplied to the part command:

_full_match_pairs
_full_match_groups_file1_only
_full_match_groups_file2_only
_full_match_groups_file1_and_file2
_name_match_pairs
_name_match_groups_file1_only
_name_match_groups_file2_only
_name_match_groups_file1_and_file2
_hashes_match_pairs
_hashes_match_groups_file1_only
_hashes_match_groups_file2_only
_hashes_match_groups_file1_and_file2
_no_match_entries_file1
_no_match_entries_file2

Because each category is written to its own output file, you can use any text editor to analyze the results, and quickly confirm that any category that should be empty actually is (i.e.: has an empty output file).

Supplemental: handling of partially-invalid input logs

When reading a hashdeep log, hashdeep-compare performs two content checks:

In the log header: the line count, hashdeep version, and recorded log format are confirmed. If these are not identical to what the hashdeep-compare test suite uses, a warning is issued. This is intended to warn the user if a different version of hashdeep (or something else) may have generated a log file that might lead to unexpected results.
Each log entry line is checked for correct formatting: incorrectly-formatted lines are ignored by hashdeep-compare. If any are found, the number of these ignored lines is reported in a warning message.

(Note: These checks are here for extra safety. I've never seen hashdeep generate an invalid line: if you have one of these, you should probably figure out why before you rely on the output.)