Crate backup_deduplicator

Expand description

§Backup Deduplicator

A tool to deduplicate backups. It builds a hash tree of all files and folders in a target directory. Optionally also traversing into archives like zip or tar files (feature in development). The hash tree is then used to find duplicate files and folders. The output is a minimal duplicated set. Therefore, the tool discovers entire duplicated folder structures and not just single files.

Backup Deduplicator solves the problem of having multiple backups of the same data, whereas some parts of the data are duplicated. Duplicates can be reviewed and removed to save disk space (feature in development).

§Features

Multi threading: The tool is able to use multiple threads to speed up the hash calculation process.
Pause and resume: The tool can be paused (killed) and resumed at any time. The current state is saved to disk and can be loaded later. This is useful for long analysis processes (large directories).
Cache and resume: The tool can be run at a later point reusing the cache from a previous run. This is useful for re-analyzing a directory after some changes have been made.
Follow or not follow symlinks: The tool can be configured to follow symlinks or not.
Hash collision robustness: The tool uses hashes to detect duplicates. There is a probability of hash collisions. For the final duplicate detection, not only the hash but also the file size and file types are compared to reduce the probability of false positives. When choosing a weak hash function (with many false duplicates), the tool may run slower.

§Planned-features

Archive support: The tool will be able to traverse into archives like zip or tar files to find duplicated structures there.
CUI: A graphical command line interface will be added to allow easy duplicate processing (removal/excluding/…).
Multi machine analysis: The tool will be able to analyze a (shared) directory on multiple machines in parallel to speed up the analysis process.
Merge: The tool will be able to merge analysis files such that analysis results from different machines can be combined.
Hardlinks: The tool will be able to detect hardlinks and treat them as not duplicates (if set via flags).
Evaluation modes: Different analysis modes. Allowing for example to set a directory of truth (archival directory) to compare against. Every file/folder already in the truth directory, found elsewhere will be marked as duplicate to remove.

§Usage

The tool is a command line tool. There are two stages: build and analyze.

Build: The tool builds a hash tree of the target directory. This is done by running backup-deduplicator build [OPTIONS] <target>. The hash tree is saved to disk and is used by the next stage.
Analyze: The tool analyzes the hash tree to find duplicates. This is done by running backup-deduplicator analyze [OPTIONS]. The tool will output a list of duplicated structures to an analysis result file.

§Build

Exemplary usage to build a hash tree of a directory:

backup-deduplicator
  --threads 16
  build
  --working-directory /parent
  --output /parent/hash.bdd
  /parent/target

This will build a hash tree of the directory /path/to/parent/target and save it to hash.bdd in the parent directory. The tool will use 16 threads to split the hash calculation work.

§Analyze

Exemplary usage to analyze a hash tree:

backup-deduplicator
  analyze
  --output /parent/analysis.bdd
  /parent/hash.bdd

This will analyze the hash tree in hash.bdd and save the analysis result to analysis.bdd. The analysis file will then contain a list of JSON objects (one per line), each representing a found duplicated structure.

Further processing with this tool is in development.

§Installation

The tool is written in Rust, and can be installed using cargo:

cargo install backup-deduplicator

Precompiled binaries are available for download on the release page https://github.com/0xCCF4/BackupDeduplicator/releases.

§Features Flags

The tool uses the rust features flags to enable or disable certain features. The following flags are available:

hash-sha1: Use the sha1 module to enable SHA1 hash function
hash-sha2: Use the sha2 module to enable SHA512, SHA256 hash functions
hash-xxh: Use the xxhash-rust module to enable XXH3 (32/64) hash functions

§Contribution

Contributions to the project are welcome! If you have a feature request, bug report, or want to contribute to the code, please open an issue or a pull request.

§License

This project is licensed under the GPLv3 license. See the LICENSE file for details.

§Inner-workings

The tool is run in four stages:

     Input         Execution       Output                          
┌───────────┐         ┌┐                                           
│ HashTree  ◄─────────┼┼────────┐                                  
│           │         ││        │                                  
│(optional) ├──┐ ┌────▼▼────┐  ┌┴────────────────┐                 
└───────────┘  └─►          │  │                 │                 
                 │  Build   ├──►  HashTree       │                 
┌───────────┐  ┌─►          │  │                 │                 
│  Folder   ├──┘ └────┬┬────┘  └┬────────────────┘                 
│   -file   │         ││        │                                  
│   -file   │ ┌───────┼┼────────┘                                  
└───┬────┬──┘ │       ││                                           
    │    │    │  ┌────▼▼────┐  ┌─────────────────┐                 
    │    │    │  │          │  │                 │                 
    │    │    └──► Analyze  ├──► Duplicate Sets  │                 
    │    │       │          │  │                 │                 
    │    │       └────┬┬────┘  └┬────────────────┘                 
    │    │            ││        │      Basic functionality complete
----│----│----┌───────┼┼────────┘----------------------------------
    │    │    │       ││                 Implementation in progress
    │    │    │  ┌────▼▼────┐  ┌─────────────────┐                 
    │    │    └──►          │  │                 │                 
    │    │       │  Dedup   ├──► Change commands │                 
    │    └───────►          │  │                 │                 
    │            └────┬┬────┘  └┬────────────────┘                 
    │                 ││        │                                  
    │         ┌───────┼┼────────┘                                  
    │         │       ││                                           
    │         │  ┌────▼▼────┐                                      
    │         └──►          │                                      
    │            │ Execute  ├──►Deduplicated files                 
    └────────────►          │                                      
                 └──────────┘

Build: The tools reads a folder and builds a hash tree of all files in it.
Analyze: The tool analyzes the hash tree and finds duplicate files.
Dedup: The tool determine which steps to take to deduplicate the files. This can be done in a half automatic or manual way.
Execute: The tool executes the deduplication steps (Deleting/Hardlinking/…).

Dedup and Execute are in development and currently not (fully) implemented.

§Build

Input: Folder with files, Hashtree (optional) to update or continue from.
Output: HashTree
Execution: Fully automatic, no user interaction required, multithreaded.

§HashTree file format

The HashTree is stored in a file with the following format:

HEADER [newline]
ENTRY [newline]
ENTRY [newline]
...

See HashTreeFileEntry for the exact format of an entry. In short, it contains every information about an analyzed file or directory that is needed for later stages (JSON):

File path
File type
Last modified time
File size
Hash of the file
Children hashes (if it is a directory)

While analyzing entries are only appended to the file. After the analysis is done, the file is fed into the clean command that removes all entries that are outdated or do not exist anymore, rewriting the entire file (but only shrinking it).

The clean command can also be run manually.

§Analyze

Input: HashTree
Output: Duplicate sets
Execution: Fully automatic, no user interaction required, multithreaded file parsing, single-threaded duplication detection.

§Analysis results

The analysis results are stored in a file with the following format:

[ENTRY] [newline]
[ENTRY] [newline]
...

See ResultEntry for the exact format of an entry. In short, it contains (JSON)

File type
Hash
Size (0 if it is a directory, else the file size of one of the files)
Conflicting Set (a set of all files that are duplicates of each other)

§Dedup

Input: Duplicate sets
Output: Set of commands to execute to deduplicate the files
Execution: Manual or half-automatic, user interaction required.

Implementation in progress. To the current date the duplicate sets must be manually processed.

§Execute

Input: Set of commands
Output: Deduplicated files
Execution: Fully automatic, user interaction only on errors.

Implementation in progress.

Modules§

fileid
hash
path
pool
stages
utils

Crate backup_deduplicatorCopy item path

§Backup Deduplicator

§Features

§Planned-features

§Usage

§Build

§Analyze

§Installation

§Features Flags

§Contribution

§License

§Inner-workings

§Build

§HashTree file format

§Analyze

§Analysis results

§Dedup

§Execute

Modules§

Crate backup_deduplicator