check_hashes 1.0.0

A fast and parallelized tool to detect duplicate files in a directory based on hashes.
# Duplicate File Finder


A fast and efficient tool to detect duplicate files in a directory based on file content.

---

## Features


- **Partial Hashing** for quick initial grouping (reads first 4 KB).
- **Full Hashing** for final confirmation (full file read or memory-mapped).
- **Parallelized** using Rayon for high performance.
- **Progress Bars** for visual feedback.
- **Supports large datasets** and very large files.
- **Colored terminal output** for better readability.

---

## Usage


### 1. Install Rust (if you don't have it)


```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```

### 2. Clone and build the project


```bash
git clone https://github.com/yourusername/duplicate-file-finder.git
cd duplicate-file-finder
cargo build --release
```

### 3. Run the program


```bash
cargo run -- --path /path/to/your/directory
```

Or using the compiled release binary:

```bash
./target/release/duplicate-file-finder --path /path/to/your/directory
```

---

## Example


```bash
cargo run -- --path ./Downloads
```

Sample output:

```
Scanning files...
Found 5321 files. Computing partial hashes...
Grouping files by partial hash...
421 candidate files after partial hashing. Computing full hashes...
Grouping by full hash...

❌ Duplicates found:

Group 1 (2 files) - Hash: d2f1d7e91c8b...
  /path/to/file1.jpg
  /path/to/file1_copy.jpg

Group 2 (3 files) - Hash: a34e1b1fe98d...
  /path/to/doc1.pdf
  /path/to/backup/doc1.pdf
  /path/to/archive/old/doc1.pdf

Found 2 duplicate groups.

Summary: Scanned 5321 files in 1m 12s.
```

---

## Command-Line Arguments


| Argument | Description                  | Example                          |
|:---------|:------------------------------|:---------------------------------|
| `--path` or `-p` | Directory to scan recursively | `--path ./Documents`              |

---

## How It Works


- **Step 1**: Scan all files under the given directory recursively.
- **Step 2**: Compute a partial hash (first 4KB) of each file.
- **Step 3**: Group files with identical partial hashes.
- **Step 4**: Compute full hashes for the candidate groups.
- **Step 5**: Report groups of true duplicates based on full file content.

This two-step approach makes it **very fast** even for very large folders.

---

## Dependencies


This project uses:

- [`blake3`]https://docs.rs/blake3/latest/blake3/ for fast cryptographic hashing.
- [`clap`]https://docs.rs/clap/latest/clap/ for argument parsing.
- [`rayon`]https://docs.rs/rayon/latest/rayon/ for parallel processing.
- [`indicatif`]https://docs.rs/indicatif/latest/indicatif/ for progress bars.
- [`colored`]https://docs.rs/colored/latest/colored/ for colored terminal output.
- [`walkdir`]https://docs.rs/walkdir/latest/walkdir/ for recursive file walking.
- [`memmap2`]https://docs.rs/memmap2/latest/memmap2/ for memory-mapping large files.

Install all dependencies automatically when you run `cargo build`.

---

## License


This project is licensed under the MIT License. See [`LICENSE`](LICENSE) for more information.