rustdupe 0.2.0 - Docs.rs

# Technical Feasibility Document

## RustDupe Advanced Features Assessment

> **Document Version:** 1.0  
> **Date:** 2026-02-05  
> **Project:** RustDupe - Smart Duplicate File Finder  
> **Architecture:** Rust 1.85+, synchronous with rayon parallelism  

---

## Table of Contents

1. [Executive Summary](#1-executive-summary)
2. [Methodology](#2-methodology)
3. [Feasibility Evaluation Matrix](#3-feasibility-evaluation-matrix)
4. [Detailed Feasibility Analysis](#4-detailed-feasibility-analysis)
5. [Architecture Impact Assessment](#5-architecture-impact-assessment)
6. [Dependency Analysis](#6-dependency-analysis)
7. [Risk Assessment Matrix](#7-risk-assessment-matrix)
8. [Recommendations](#8-recommendations)

---

## 1. Executive Summary

This document provides a comprehensive technical feasibility assessment for 12 proposed advanced features for RustDupe. The assessment evaluates each feature against the current synchronous, rayon-based architecture using BLAKE3 hashing, SQLite caching, and ratatui TUI.

### Key Findings

| Category | Count | Features |
|----------|-------|----------|
| **Tier 1 (Recommended)** | 4 | Memory-Mapped I/O, Bloom Filters, Multi-Directory Scanning, Database Backend Options |
| **Tier 2 (Conditional)** | 4 | Perceptual Image Hashing, Progressive Results, Real-Time Monitoring, Fuzzy Text Detection |
| **Tier 3 (Future Consideration)** | 2 | Audio Fingerprinting, GPU-Accelerated Hashing |
| **Not Recommended** | 2 | Cloud Storage Integration, GUI Application |

### Critical Dependencies

- **Immediate Wins:** `memmap2`, `bloom`, `sled`/`redb` - Low effort, high value
- **Medium Complexity:** `image_hasher`, `notify`, `crossbeam-channel` - Moderate effort, solid ecosystem
- **High Complexity:** `chromaprint` (FFI), `rustacuda` - Significant integration challenges

---

## 2. Methodology

### Evaluation Criteria

Each feature was evaluated using the following framework:

#### Complexity Assessment
- **Low:** Pure Rust crate, straightforward API, minimal architectural changes
- **Medium:** Requires new module, some API design, moderate testing
- **High:** Complex integration, FFI bindings, or significant architectural refactoring
- **Very High:** New paradigm (async/GPU), major subsystem, external dependencies

#### Risk Factors
1. **Technical Risk:** Implementation uncertainty, performance concerns
2. **Integration Risk:** Compatibility with existing sync+rayon architecture
3. **Maintenance Risk:** Crate maintenance status, API stability
4. **License Risk:** Compatibility with MIT license

#### Value Assessment
- **High:** Significant user benefit, competitive advantage
- **Medium:** Nice-to-have feature
- **Low:** Limited use cases or high effort/reward ratio

### Data Sources

- **Research Reports:**
  - `docs/research/advanced-duplicate-finder-features-2025-02-05.md`
  - `docs/research/cross-platform-file-management-2026-02-05.md`
- **Crates.io:** Download statistics, maintenance activity, MSRV compatibility
- **Architecture Review:** Current codebase analysis
- **Benchmark Data:** BLAKE3, perceptual hashing, database performance comparisons

---

## 3. Feasibility Evaluation Matrix

| Feature | Feasibility | Effort (Weeks) | Risk | Dependencies | Recommendation |
|---------|-------------|----------------|------|--------------|----------------|
| 1. Perceptual Image Hashing | Medium | 2-3 | Medium | `image_hasher`, `image` | **Tier 2** |
| 2. Audio Fingerprinting | Low | 4-6 | High | `chromaprint` (FFI) | **Tier 3** |
| 3. Fuzzy/Near-Duplicate Text | Medium | 2-3 | Medium | `simhash`, `serde_json` | **Tier 2** |
| 4. GPU-Accelerated Hashing | Low | 6-8 | High | `rustacuda`, `cust` | **Tier 3** |
| 5. Memory-Mapped File I/O | **High** | **1** | Low | `memmap2` | **Tier 1** |
| 6. Bloom Filters | **High** | **1** | Low | `bloom` | **Tier 1** |
| 7. Real-Time Monitoring | Medium | 2-3 | Medium | `notify` | **Tier 2** |
| 8. Cloud Storage Integration | Low | 4-6 | High | Multiple OAuth clients | **Not Recommended** |
| 9. GUI Application | Low | 8-12 | Very High | `egui`, `iced`, or `tauri` | **Not Recommended** |
| 10. Multi-Directory Scanning | **High** | **1-2** | Low | None (refactoring) | **Tier 1** |
| 11. Database Backend Options | **High** | **2** | Low | `sled`, `redb` | **Tier 1** |
| 12. Progressive/Streaming Results | Medium | 2-3 | Medium | `crossbeam-channel` | **Tier 2** |

---

## 4. Detailed Feasibility Analysis

### Feature 1: Perceptual Image Hashing

**Detect visually similar images despite compression, resizing, or minor edits.**

#### Technical Approach

Implement a new `scanner::perceptual` module that computes perceptual hashes for image files alongside BLAKE3 content hashes. The perceptual hash acts as a secondary grouping mechanism.

```rust
// Proposed API
pub struct PerceptualHasher {
    config: HasherConfig,
}

impl PerceptualHasher {
    pub fn compute_hash(&self, path: &Path) -> Result<ImageHash, Error> {
        let image = image::open(path)?;
        let hasher = self.config.to_hasher();
        Ok(hasher.hash_image(&image))
    }
    
    pub fn is_similar(&self, hash1: &ImageHash, hash2: &ImageHash, threshold: u32) -> bool {
        hash1.dist(hash2) <= threshold
    }
}
```

#### Required Crates

| Crate | Version | Purpose | License | Downloads |
|-------|---------|---------|---------|-----------|
| `image_hasher` | 2.0 | Perceptual hashing algorithms | MIT | 500K+ |
| `image` | 0.25 | Image decoding (JPEG, PNG, etc.) | MIT/Apache-2.0 | 100M+ |
| `bk-tree` | 0.5 | Efficient similarity search | MIT | 1M+ |

#### Integration Points

1. **Scanner Module:** Add `perceptual_hash` field to `FileEntry`
2. **Duplicate Finder:** Extend grouping logic to support perceptual similarity
3. **CLI:** Add `--similar-images` flag with threshold parameter
4. **Cache:** Add perceptual hash column to SQLite schema

#### Risks and Challenges

| Risk | Impact | Mitigation |
|------|--------|------------|
| Image decoding failures | Medium | Graceful error handling, skip corrupted images |
| Memory usage for large images | Medium | Limit image size, use streaming when possible |
| False positive rate | Medium | Configurable threshold (default: dHash with threshold=2) |
| Performance on large image sets | Low | Parallelize with rayon (existing infrastructure) |

#### Effort Estimate Breakdown

- **Core implementation:** 3-4 days
- **CLI integration:** 1-2 days
- **Cache integration:** 2-3 days
- **Testing (unit + integration):** 2-3 days
- **Documentation:** 1 day
- **Total:** 2-3 weeks

#### Prototype/POC Suggestion

```rust
// 1-day proof of concept
use image_hasher::{HasherConfig, HashAlg};

fn demo_perceptual_hash(image_path: &Path) {
    let hasher = HasherConfig::new()
        .hash_alg(HashAlg::Gradient)  // dHash variant
        .hash_size(8, 8)
        .to_hasher();
    
    let image = image::open(image_path).unwrap();
    let hash = hasher.hash_image(&image);
    println!("Hash: {:?}", hash.to_base64());
}
```

#### Go/No-Go Recommendation

🟡 **GO with conditions:**
- Start with dHash algorithm only (simpler than pHash)
- Require explicit opt-in via flag
- Document performance implications
- Consider as "beta" feature initially

---

### Feature 2: Audio Fingerprinting

**Detect duplicate music files even across different encodings or bitrates.**

#### Technical Approach

Use Chromaprint (AcoustID) via FFI to compute audio fingerprints. This requires external C library or CLI tool integration.

```rust
// Two approaches:

// Option A: FFI bindings (complex)
extern "C" {
    fn chromaprint_decode_fingerprint(...);
}

// Option B: CLI wrapper (simpler)
fn get_audio_fingerprint(path: &Path) -> Result<String> {
    let output = Command::new("fpcalc")
        .args(["-json", path.to_str().unwrap()])
        .output()?;
    
    let json: serde_json::Value = serde_json::from_slice(&output.stdout)?;
    Ok(json["fingerprint"].as_str().unwrap().to_string())
}
```

#### Required Crates

| Crate | Version | Purpose | License | Notes |
|-------|---------|---------|---------|-------|
| `chromaprint` | N/A | Audio fingerprinting | LGPL | No pure Rust crate available |
| `chromaprint-sys` | N/A | FFI bindings | - | Would need to create/maintain |
| `serde_json` | 1.x | Parse fpcalc output | MIT/Apache | Already in dependencies |

#### Integration Points

1. **New Module:** `scanner::audio` for audio-specific processing
2. **Feature Flag:** `audio` (optional compilation)
3. **Dependency:** Runtime dependency on `fpcalc` binary
4. **Cache:** New column for audio fingerprint

#### Risks and Challenges

| Risk | Impact | Severity |
|------|--------|----------|
| No mature Rust crate | High | Requires FFI or external binary |
| LGPL licensing | Medium | Chromaprint is LGPL (compatible but complex) |
| External binary dependency | High | Users must install fpcalc |
| Cross-platform distribution | High | Complex packaging requirements |
| Performance (decoding audio) | Medium | ~100ms per file, needs parallelization |

#### Effort Estimate Breakdown

- **FFI investigation:** 3-5 days
- **Chromaprint integration:** 5-7 days
- **Cross-platform build setup:** 3-5 days
- **Testing across platforms:** 3-5 days
- **Documentation/packaging:** 2-3 days
- **Total:** 4-6 weeks

#### Go/No-Go Recommendation

🔴 **NO-GO for now:**

**Rationale:**
- No pure Rust solution available
- Requires external C library integration
- LGPL licensing complications
- High maintenance burden

**Alternative:** Document workaround using external tools (e.g., run `fpcalc` separately, import results)

---

### Feature 3: Fuzzy/Near-Duplicate Text Detection

**Detect similar documents using text-based similarity algorithms.**

#### Technical Approach

Extract text from documents (PDF, DOCX, TXT) and compute SimHash or MinHash fingerprints for near-duplicate detection.

```rust
pub struct DocumentHasher {
    algorithm: DocumentHashAlgorithm,
}

pub enum DocumentHashAlgorithm {
    SimHash,    // 64-bit fingerprint, Hamming distance
    MinHash,    // Set similarity, Jaccard index
}

impl DocumentHasher {
    pub fn hash_document(&self, path: &Path) -> Result<DocumentHash> {
        let text = extract_text(path)?;
        match self.algorithm {
            DocumentHashAlgorithm::SimHash => self.simhash(&text),
            DocumentHashAlgorithm::MinHash => self.minhash(&text),
        }
    }
}
```

#### Required Crates

| Crate | Version | Purpose | License | Downloads |
|-------|---------|---------|---------|-----------|
| `simhash` | 0.2 | SimHash implementation | MIT | 100K+ |
| `pdf-extract` | 0.7 | PDF text extraction | MIT | 200K+ |
| `docx-rs` | 0.4 | DOCX parsing | MIT | 300K+ |
| `loole` | 0.1 | Shingling support | MIT | 50K+ |

#### Integration Points

1. **New Module:** `scanner::document`
2. **Text Extraction Pipeline:** PDF → DOCX → Plain text
3. **Similarity Threshold:** Configurable Hamming distance
4. **CLI:** `--similar-documents` flag

#### Risks and Challenges

| Risk | Impact | Mitigation |
|------|--------|------------|
| Text extraction failures | Medium | Fallback to content hash |
| Document format support | Medium | Start with PDF/TXT, add DOCX later |
| Unicode normalization | Low | Use existing `unicode-normalization` crate |
| Performance | Low | Rayon parallelization |

#### Effort Estimate Breakdown

- **Text extraction framework:** 3-4 days
- **SimHash implementation:** 2-3 days
- **Similarity matching:** 2-3 days
- **CLI integration:** 1-2 days
- **Testing:** 2-3 days
- **Total:** 2-3 weeks

#### Go/No-Go Recommendation

🟡 **GO with scope limitation:**

**Recommended scope:**
- Start with plain text and PDF only
- Use SimHash (simpler than MinHash)
- Require opt-in via `--similar-documents`
- Document limitations

---

### Feature 4: GPU-Accelerated Hashing

**Use CUDA/OpenCL to accelerate hash computation for large files.**

#### Technical Approach

Offload BLAKE3 hashing to GPU for large files. However, BLAKE3 is already extremely fast on CPU with SIMD (~8.4 GB/s single-thread).

```rust
#[cfg(feature = "cuda")]
pub struct GpuHasher {
    context: cuda::Context,
}

impl GpuHasher {
    pub fn hash_file(&self, path: &Path) -> Result<Hash> {
        // Transfer file to GPU memory
        // Launch BLAKE3 kernel
        // Retrieve result
    }
}
```

#### Required Crates

| Crate | Version | Purpose | License | Notes |
|-------|---------|---------|---------|-------|
| `rustacuda` | 0.2 | CUDA bindings | MIT/Apache | Limited maintenance |
| `cust` | 0.3 | Safe CUDA wrapper | MIT/Apache | Newer alternative |
| `opencl3` | 0.9 | OpenCL bindings | MIT/Apache | Cross-platform |

#### Integration Points

1. **New Module:** `scanner::gpu`
2. **Feature Flag:** `cuda` or `opencl`
3. **Runtime Detection:** Check GPU availability
4. **Fallback:** CPU hashing if GPU unavailable

#### Risks and Challenges

| Risk | Impact | Severity |
|------|--------|----------|
| Limited benefit | Very High | BLAKE3 @ 8.4 GB/s already saturates most storage |
| Cross-platform complexity | High | Different GPU vendors, drivers |
| Build complexity | High | Requires CUDA toolkit for compilation |
| Runtime dependencies | High | Users need GPU drivers |
| Maintenance | High | GPU crate ecosystem is immature |

#### Performance Reality Check

| Scenario | CPU BLAKE3 | GPU Potential | Net Benefit |
|----------|-----------|---------------|-------------|
| NVMe SSD (3.5 GB/s) | 8.4 GB/s | Limited by storage | Minimal |
| RAM disk (10 GB/s) | 8.4 GB/s | ~50 GB/s | Moderate |
| Network storage | 1 Gbps | Not applicable | None |

#### Effort Estimate Breakdown

- **CUDA research/setup:** 5-7 days
- **BLAKE3 GPU kernel:** 5-7 days
- **Integration with scanner:** 3-5 days
- **Cross-platform testing:** 5-7 days
- **Documentation:** 2-3 days
- **Total:** 6-8 weeks

#### Go/No-Go Recommendation

🔴 **NO-GO:**

**Rationale:**
- BLAKE3 on CPU is already faster than most storage
- Diminishing returns for typical use cases
- High complexity, immature Rust GPU ecosystem
- Maintenance burden not justified by benefit

---

### Feature 5: Memory-Mapped File I/O

**Use memory-mapped files for improved I/O performance.**

#### Technical Approach

Replace buffered file reading with memory-mapped I/O for large files. BLAKE3 supports `update_mmap` for this purpose.

```rust
use memmap2::Mmap;

pub fn hash_file_mmap(path: &Path) -> Result<Hash> {
    let file = File::open(path)?;
    let mmap = unsafe { Mmap::map(&file)? };
    
    let mut hasher = blake3::Hasher::new();
    hasher.update(&mmap);
    Ok(*hasher.finalize().as_bytes())
}

// Even simpler with BLAKE3's built-in support
pub fn hash_file_mmap_rayon(path: &Path) -> Result<Hash> {
    let mut hasher = blake3::Hasher::new();
    hasher.update_mmap_rayon(path)?;  // Memory-mapped + parallel
    Ok(*hasher.finalize().as_bytes())
}
```

#### Required Crates

| Crate | Version | Purpose | License | Downloads |
|-------|---------|---------|---------|-----------|
| `memmap2` | 0.9 | Memory-mapped files | MIT/Apache | 200M+ |

#### Integration Points

1. **Scanner Module:** Add `full_hash_mmap` method to `Hasher`
2. **Configuration:** Add `--use-mmap` CLI flag
3. **Heuristics:** Use mmap for files > 64KB (avoid overhead for small files)

#### Risks and Challenges

| Risk | Impact | Mitigation |
|------|--------|------------|
| `unsafe` code requirement | Low | Well-established crate, minimal unsafe block |
| Network filesystem issues | Medium | Detect and fallback to buffered I/O |
| Small file overhead | Low | Only use mmap for files > threshold |
| File modification during hash | Low | OS handles consistency |

#### Effort Estimate Breakdown

- **Implementation:** 1-2 days
- **CLI integration:** 0.5 days
- **Testing:** 1-2 days
- **Documentation:** 0.5 days
- **Total:** 1 week

#### Go/No-Go Recommendation

🟢 **GO - High Priority:**

**Rationale:**
- Minimal effort (1 week)
- Well-established crate (`memmap2`)
- Potential performance improvement for large files
- BLAKE3 has built-in support
- Low risk

---

### Feature 6: Bloom Filters

**Use probabilistic data structures for quick rejection of non-duplicates.**

#### Technical Approach

Use Bloom filters to quickly determine if a file size or prehash has been seen before, avoiding unnecessary full hash computation.

```rust
use bloom::BloomFilter;

pub struct DuplicateDetector {
    size_filter: BloomFilter,      // File sizes seen
    prehash_filter: BloomFilter,   // Prehashes seen
}

impl DuplicateDetector {
    pub fn might_be_duplicate(&self, entry: &FileEntry, prehash: &Hash) -> bool {
        // Fast path: check if size exists
        if !self.size_filter.contains(&entry.size) {
            self.size_filter.insert(&entry.size);
            return false;  // Definitely not a duplicate
        }
        
        // Check prehash
        if !self.prehash_filter.contains(prehash) {
            self.prehash_filter.insert(prehash);
            return false;  // Definitely not a duplicate
        }
        
        true  // Possibly a duplicate, needs full hash
    }
}
```

#### Required Crates

| Crate | Version | Purpose | License | Downloads |
|-------|---------|---------|---------|-----------|
| `bloom` | 0.3 | Bloom filter | MIT | 5M+ |
| `growable-bloom-filter` | 2.0 | Auto-resizing variant | MIT | 100K+ |

#### Integration Points

1. **New Module:** `duplicates::bloom` or extend `finder.rs`
2. **Configuration:** False positive rate (default: 1%)
3. **Memory Management:** Estimate filter size based on expected file count

#### Risks and Challenges

| Risk | Impact | Mitigation |
|------|--------|------------|
| Memory usage | Low | Configurable FP rate trades memory for accuracy |
| False positives | Low | 1% FP rate acceptable (full hash verifies) |
| Filter sizing | Low | Can use growable variant |

#### Effort Estimate Breakdown

- **Implementation:** 2-3 days
- **Integration:** 1 day
- **Testing:** 1-2 days
- **Documentation:** 0.5 days
- **Total:** 1 week

#### Go/No-Go Recommendation

🟢 **GO - High Priority:**

**Rationale:**
- Very low effort
- Significant performance improvement for large scans
- Reduces unnecessary hash computation
- Well-established algorithm and crates

---

### Feature 7: Real-Time Monitoring

**Watch directories for changes and detect new duplicates automatically.**

#### Technical Approach

Use the `notify` crate for cross-platform filesystem monitoring with debouncing.

```rust
use notify::{Watcher, RecursiveMode, watcher};
use std::sync::mpsc::channel;

pub struct DirectoryMonitor {
    watcher: RecommendedWatcher,
    rx: Receiver<DebouncedEvent>,
}

impl DirectoryMonitor {
    pub fn new(paths: &[PathBuf]) -> Result<Self> {
        let (tx, rx) = channel();
        let mut watcher = watcher(tx, Duration::from_secs(2))?;
        
        for path in paths {
            watcher.watch(path, RecursiveMode::Recursive)?;
        }
        
        Ok(Self { watcher, rx })
    }
    
    pub fn run(&self, mut callback: impl FnMut(&Path)) -> Result<()> {
        loop {
            match self.rx.recv() {
                Ok(DebouncedEvent::Create(path)) |
                Ok(DebouncedEvent::Write(path)) => {
                    callback(&path);
                }
                Ok(DebouncedEvent::Remove(path)) => {
                    // Invalidate cache entry
                }
                Err(e) => return Err(e.into()),
                _ => {}
            }
        }
    }
}
```

#### Required Crates

| Crate | Version | Purpose | License | Downloads |
|-------|---------|---------|---------|-----------|
| `notify` | 7.0 | Cross-platform file watching | CC0-1.0 | 50M+ |

#### Integration Points

1. **New Module:** `monitor.rs` at crate root
2. **CLI:** `rustdupe monitor <paths...>` subcommand
3. **Session Integration:** Resume from cache on startup
4. **TUI Integration:** Real-time duplicate notifications

#### Risks and Challenges

| Risk | Impact | Mitigation |
|------|--------|------------|
| Platform differences | Medium | `notify` abstracts most differences |
| High event volume | Medium | Debouncing already included |
| Recursive watching limits | Low | Document OS-specific limits |
| Battery impact (laptops) | Low | Only watch when explicitly requested |

#### Effort Estimate Breakdown

- **Core monitoring:** 2-3 days
- **Integration with scanner:** 2-3 days
- **CLI subcommand:** 1 day
- **Testing across platforms:** 2-3 days
- **Documentation:** 1 day
- **Total:** 2-3 weeks

#### Go/No-Go Recommendation

🟡 **GO - Medium Priority:**

**Rationale:**
- Well-established crate
- Useful for ongoing duplicate management
- Moderate effort
- Consider as background service feature

---

### Feature 8: Cloud Storage Integration

**Direct integration with Dropbox, OneDrive, Google Drive APIs.**

#### Technical Approach

Access cloud storage APIs to scan files without downloading locally.

```rust
pub enum CloudProvider {
    Dropbox,
    GoogleDrive,
    OneDrive,
}

pub struct CloudScanner {
    provider: CloudProvider,
    client: Box<dyn CloudClient>,
}

#[async_trait]
pub trait CloudClient {
    async fn list_files(&self, path: &str) -> Result<Vec<CloudFile>>;
    async fn get_hash(&self, file_id: &str) -> Result<Option<String>>;
}
```

#### Required Crates

| Crate | Version | Purpose | License | Complexity |
|-------|---------|---------|---------|------------|
| `reqwest` | 0.12 | HTTP client | MIT/Apache | High |
| `oauth2` | 4.4 | OAuth authentication | MIT/Apache | High |
| `serde_json` | 1.x | API response parsing | MIT/Apache | Already have |
| Various SDKs | - | Provider-specific APIs | Varies | Very High |

#### Integration Points

1. **New Module:** `cloud.rs` with provider submodules
2. **Authentication:** OAuth flow for each provider
3. **Async Runtime:** Requires `tokio` (conflicts with sync architecture)
4. **Cache:** Store cloud file metadata

#### Risks and Challenges

| Risk | Impact | Severity |
|------|--------|----------|
| Requires async runtime | Critical | Conflicts with sync+rayon architecture |
| OAuth complexity | High | User authentication flow |
| API rate limits | High | Throttling, quotas |
| Provider API changes | Medium | Breaking changes, maintenance |
| Security considerations | High | Token storage, encryption |
| Scope explosion | High | 3+ providers with different APIs |

#### Effort Estimate Breakdown

- **Architecture changes (async):** 2-3 weeks
- **OAuth implementation:** 1-2 weeks
- **Dropbox integration:** 1 week
- **Google Drive integration:** 1-2 weeks
- **OneDrive integration:** 1-2 weeks
- **Testing:** 1-2 weeks
- **Documentation:** 3-5 days
- **Total:** 6-10 weeks

#### Go/No-Go Recommendation

🔴 **NO-GO:**

**Rationale:**
- Requires async runtime (major architectural change)
- High complexity across multiple providers
- Rate limiting and API maintenance burden
- Alternative exists: scan local sync folders

**Alternative:**
- Document how to scan Dropbox/Google Drive local folders
- Users get same result with simpler implementation

---

### Feature 9: GUI Application

**Native GUI alongside TUI using egui, iced, or tauri.**

#### Technical Approach

Create a separate GUI binary or make TUI/GUI interchangeable.

**Option A: egui (Immediate Mode)**
```rust
use eframe::egui;

pub struct RustDupeApp {
    scan_state: ScanState,
    duplicates: Vec<DuplicateGroup>,
}

impl eframe::App for RustDupeApp {
    fn update(&mut self, ctx: &egui::Context, _frame: &mut eframe::Frame) {
        egui::CentralPanel::default().show(ctx, |ui| {
            ui.heading("RustDupe");
            // GUI implementation
        });
    }
}
```

**Option B: Tauri (Web-based)**
- React/Vue frontend
- Rust backend via Tauri commands

#### Required Crates

| Crate | Version | Purpose | License | Binary Size |
|-------|---------|---------|---------|-------------|
| `eframe` | 0.29 | egui framework | MIT/Apache | +2-5 MB |
| `iced` | 0.12 | Elm architecture | MIT | +3-6 MB |
| `tauri` | 2.0 | Web-based GUI | MIT/Apache | +5-15 MB |
| `dioxus` | 0.5 | React-like framework | MIT/Apache | +3-8 MB |

#### Integration Points

1. **Separate Binary:** `rustdupe-gui` crate
2. **Shared Library:** Core logic in `rustdupe-lib`
3. **UI Abstraction:** Common interface for TUI and GUI

#### Risks and Challenges

| Risk | Impact | Severity |
|------|--------|----------|
| Binary size increase | High | +50-200% size increase |
| Maintenance burden | Very High | Two UIs to maintain |
| Architecture complexity | High | Shared core abstraction |
| Feature parity | Medium | GUI may lag behind TUI |
| Cross-platform testing | High | GUI behavior varies by OS |

#### Effort Estimate Breakdown

- **Architecture refactoring:** 1-2 weeks
- **GUI framework setup:** 1 week
- **Core GUI implementation:** 3-4 weeks
- **Feature parity with TUI:** 2-3 weeks
- **Testing across platforms:** 2-3 weeks
- **Documentation:** 1 week
- **Total:** 8-12 weeks

#### Go/No-Go Recommendation

🔴 **NO-GO:**

**Rationale:**
- TUI is already excellent (ratatui is mature)
- High effort, high maintenance
- Binary size increase
- Would split development focus

**Alternative:**
- Improve TUI further (already planned)
- Consider GUI as separate future project

---

### Feature 10: Multi-Directory Scanning

**Scan multiple root directories simultaneously.**

#### Technical Approach

Extend CLI and scanner to accept multiple paths, merging results.

```rust
// Current
pub fn scan(path: &Path, config: &ScanConfig) -> Result<Vec<FileEntry>>;

// Proposed
pub fn scan_multiple(paths: &[PathBuf], config: &ScanConfig) -> Result<ScanResult> {
    let all_entries: Vec<FileEntry> = paths
        .par_iter()  // rayon parallel iteration
        .map(|path| scan_single(path, config))
        .collect::<Result<Vec<_>>>()?
        .into_iter()
        .flatten()
        .collect();
    
    // Deduplicate across all paths
    find_duplicates(&all_entries)
}
```

#### Required Crates

- **None** - Uses existing `rayon` infrastructure

#### Integration Points

1. **CLI:** Accept multiple `<PATH>` arguments
2. **Scanner:** Parallelize across directories
3. **Progress:** Aggregate progress from all paths
4. **Cache:** Per-directory or unified cache

#### Risks and Challenges

| Risk | Impact | Mitigation |
|------|--------|------------|
| Cross-device duplicates | Low | Handle different filesystems |
| Path conflicts | Low | Use canonical paths |
| Memory usage | Low | Stream results, don't buffer all |

#### Effort Estimate Breakdown

- **CLI changes:** 1-2 days
- **Scanner refactoring:** 2-3 days
- **Progress integration:** 1-2 days
- **Testing:** 1-2 days
- **Documentation:** 0.5 days
- **Total:** 1-2 weeks

#### Go/No-Go Recommendation

🟢 **GO - High Priority:**

**Rationale:**
- Very low effort
- High user value
- Natural extension of existing architecture
- No new dependencies

---

### Feature 11: Database Backend Options

**Support sled or redb as alternatives to SQLite.**

#### Technical Approach

Abstract cache backend to support multiple database implementations.

```rust
pub trait CacheBackend {
    fn get(&self, key: &CacheKey) -> Result<Option<CacheEntry>>;
    fn put(&self, key: &CacheKey, entry: &CacheEntry) -> Result<()>;
    fn delete(&self, key: &CacheKey) -> Result<()>;
}

pub enum CacheBackendType {
    Sqlite,  // Current
    Sled,    // Pure Rust key-value
    Redb,    // Rust-native B-tree
}

// Usage
pub struct HashCache {
    backend: Box<dyn CacheBackend>,
}
```

#### Required Crates

| Crate | Version | Purpose | License | Performance |
|-------|---------|---------|---------|-------------|
| `sled` | 0.34 | Pure Rust key-value | MIT/Apache | Very Fast |
| `redb` | 2.0 | Rust-native B-tree | MIT/Apache | Fast |
| `rusqlite` | 0.31 | Current SQLite | MIT | Good |

#### Comparison

| Aspect | SQLite | Sled | Redb |
|--------|--------|------|------|
| Pure Rust | No (C) | Yes | Yes |
| Performance | Good | Excellent | Very Good |
| Binary Size | +500KB | +300KB | +400KB |
| ACID | Yes | Yes | Yes |
| Maintenance | Excellent | Good | Good |

#### Integration Points

1. **Cache Module:** Refactor to trait-based backend
2. **Configuration:** `--cache-backend` option
3. **Migration:** Export/import between backends

#### Risks and Challenges

| Risk | Impact | Mitigation |
|------|--------|------------|
| Data migration | Medium | Provide export/import tools |
| API differences | Low | Abstract with trait |
| Testing matrix | Medium | Test all backends in CI |

#### Effort Estimate Breakdown

- **Trait abstraction:** 2-3 days
- **Sled implementation:** 2-3 days
- **Redb implementation:** 2-3 days
- **Migration tools:** 2-3 days
- **Testing:** 2-3 days
- **Documentation:** 1-2 days
- **Total:** 2 weeks

#### Go/No-Go Recommendation

🟢 **GO - High Priority:**

**Rationale:**
- Low to medium effort
- Removes C dependency (SQLite)
- Better performance with sled/redb
- Educational value for Rust ecosystem

---

### Feature 12: Progressive/Streaming Results

**Show duplicate results as they're found instead of waiting for completion.**

#### Technical Approach

Use channels to stream results from scanner to UI in real-time.

```rust
use crossbeam_channel::{Sender, Receiver};

pub struct StreamingScanner {
    result_tx: Sender<DuplicateGroup>,
}

impl StreamingScanner {
    pub fn scan_streaming(&self, paths: &[PathBuf]) -> Result<()> {
        let (file_tx, file_rx) = crossbeam_channel::unbounded();
        
        // Spawn file walker thread
        spawn(move || {
            for file in walk_files(paths) {
                file_tx.send(file).unwrap();
            }
        });
        
        // Process and stream results
        for file in file_rx {
            if let Some(duplicate) = self.process_file(file) {
                self.result_tx.send(duplicate).unwrap();
            }
        }
        
        Ok(())
    }
}
```

#### Required Crates

| Crate | Version | Purpose | License | Downloads |
|-------|---------|---------|---------|-----------|
| `crossbeam-channel` | 0.5 | Multi-producer multi-consumer channels | MIT/Apache | 200M+ |

#### Integration Points

1. **Scanner:** Streaming result channel
2. **TUI:** Real-time result updates
3. **Output:** Stream JSON/CSV as results arrive
4. **Progress:** Merge with result streaming

#### Risks and Challenges

| Risk | Impact | Mitigation |
|------|--------|------------|
| UI complexity | Medium | Ratatui supports dynamic updates |
| Ordering guarantees | Low | Accept eventual consistency |
| Memory pressure | Low | Bounded channels |
| Error handling | Medium | Send errors through channel |

#### Effort Estimate Breakdown

- **Channel architecture:** 2-3 days
- **Scanner integration:** 2-3 days
- **TUI real-time updates:** 3-4 days
- **Output streaming:** 1-2 days
- **Testing:** 2-3 days
- **Documentation:** 1 day
- **Total:** 2-3 weeks

#### Go/No-Go Recommendation

🟡 **GO - Medium Priority:**

**Rationale:**
- Good user experience improvement
- Well-established channel patterns
- Moderate effort
- Can be implemented incrementally

---

## 5. Architecture Impact Assessment

### Current Architecture

```
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│     CLI     │────▶│   Config    │────▶│   Scanner   │
└─────────────┘     └─────────────┘     └──────┬──────┘
                                                │
                       ┌─────────────┐         │
                       │    Cache    │◀────────┤
                       │  (SQLite)   │         │
                       └─────────────┘         │
                                                ▼
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Output    │◀────│  Duplicate  │◀────│   Hasher    │
│ (JSON/CSV)  │     │   Finder    │     │  (BLAKE3)   │
└─────────────┘     └─────────────┘     └─────────────┘
                            │
                            ▼
                     ┌─────────────┐
                     │     TUI     │
                     │  (ratatui)  │
                     └─────────────┘
```

### Feature Impact Summary

| Feature | Module Changes | Breaking Changes | Backward Compatible |
|---------|---------------|------------------|---------------------|
| Perceptual Hashing | `scanner/`, `duplicates/` | No | Yes |
| Audio Fingerprinting | New `scanner/audio/` | No | Yes (feature flag) |
| Fuzzy Text Detection | New `scanner/document/` | No | Yes |
| GPU Hashing | New `scanner/gpu/` | No | Yes (feature flag) |
| Memory-Mapped I/O | `scanner/hasher.rs` | No | Yes (opt-in) |
| Bloom Filters | `duplicates/finder.rs` | No | Yes |
| Real-Time Monitoring | New `monitor.rs` | No | Yes |
| Cloud Integration | New `cloud/` + async | **Yes** | No |
| GUI Application | New crate `rustdupe-gui` | No | N/A |
| Multi-Directory | `cli.rs`, `scanner/` | No | Yes |
| Database Options | `cache/` | No | Yes |
| Progressive Results | `scanner/`, `tui/` | No | Yes |

### Critical Architecture Decision

⚠️ **Cloud Storage Integration requires async runtime**, which conflicts with the current synchronous + rayon architecture. This is the only feature requiring fundamental architectural changes.

---

## 6. Dependency Analysis

### New Dependencies Required

#### Tier 1 Features (Immediate Wins)

| Crate | Version | License | Size | Maintenance |
|-------|---------|---------|------|-------------|
| `memmap2` | 0.9 | MIT/Apache | Small | Excellent |
| `bloom` | 0.3 | MIT | Tiny | Good |
| `sled` | 0.34 | MIT/Apache | Medium | Good |
| `redb` | 2.0 | MIT | Medium | Good |

#### Tier 2 Features (Medium Term)

| Crate | Version | License | Size | Notes |
|-------|---------|---------|------|-------|
| `image_hasher` | 2.0 | MIT | Medium | Active development |
| `image` | 0.25 | MIT/Apache | Large | Very popular |
| `notify` | 7.0 | CC0-1.0 | Small | Excellent |
| `crossbeam-channel` | 0.5 | MIT/Apache | Small | Essential |
| `simhash` | 0.2 | MIT | Tiny | Stable |
| `pdf-extract` | 0.7 | MIT | Medium | Moderate activity |

#### Tier 3 Features (Future)

| Crate | Version | License | Size | Concerns |
|-------|---------|---------|------|----------|
| `chromaprint` | N/A | LGPL | - | FFI complexity |
| `rustacuda` | 0.2 | MIT/Apache | Large | Limited maintenance |
| `reqwest` | 0.12 | MIT/Apache | Large | Async required |
| `oauth2` | 4.4 | MIT/Apache | Medium | Async required |

### License Compatibility Matrix

| Feature | Crates | License | Compatible with MIT? |
|---------|--------|---------|---------------------|
| Memory-Mapped I/O | `memmap2` | MIT/Apache | ✅ Yes |
| Bloom Filters | `bloom` | MIT | ✅ Yes |
| Database Options | `sled`, `redb` | MIT/Apache | ✅ Yes |
| Perceptual Hashing | `image_hasher`, `image` | MIT/Apache | ✅ Yes |
| Real-Time Monitoring | `notify` | CC0-1.0 | ✅ Yes |
| Progressive Results | `crossbeam-channel` | MIT/Apache | ✅ Yes |
| Fuzzy Text | `simhash`, `pdf-extract` | MIT | ✅ Yes |
| Audio Fingerprinting | `chromaprint` | LGPL | ⚠️ Complex |
| GPU Hashing | `rustacuda` | MIT/Apache | ✅ Yes |
| Cloud Integration | `reqwest`, `oauth2` | MIT/Apache | ✅ Yes |

### Maintenance Quality Assessment

| Crate | Last Update | Downloads | Maintenance Rating |
|-------|-------------|-----------|-------------------|
| `memmap2` | 2024-11 | 200M+ | ⭐⭐⭐⭐⭐ |
| `bloom` | 2023-08 | 5M+ | ⭐⭐⭐ |
| `sled` | 2023-01 | 5M+ | ⭐⭐⭐ |
| `redb` | 2024-12 | 1M+ | ⭐⭐⭐⭐ |
| `image_hasher` | 2024-06 | 500K+ | ⭐⭐⭐⭐ |
| `image` | 2024-12 | 100M+ | ⭐⭐⭐⭐⭐ |
| `notify` | 2024-11 | 50M+ | ⭐⭐⭐⭐⭐ |
| `crossbeam-channel` | 2024-08 | 200M+ | ⭐⭐⭐⭐⭐ |
| `chromaprint` | N/A | N/A | ⭐ (FFI complexity) |
| `rustacuda` | 2022-06 | 200K+ | ⭐⭐ |

---

## 7. Risk Assessment Matrix

### Technical Risks

| Risk | Probability | Impact | Mitigation | Status |
|------|-------------|--------|------------|--------|
| Image decoding failures | Medium | Low | Graceful fallback | Acceptable |
| Bloom filter sizing | Low | Low | Growable variant | Acceptable |
| Memory-mapped file safety | Low | Medium | `unsafe` audit | Acceptable |
| Cache backend migration | Medium | Medium | Export tools | Acceptable |
| Channel backpressure | Medium | Low | Bounded channels | Acceptable |

### Integration Risks

| Risk | Probability | Impact | Mitigation | Status |
|------|-------------|--------|------------|--------|
| Sync/async conflict | High | Critical | Avoid async features | **Avoided** |
| Feature flag complexity | Medium | Low | CI testing | Acceptable |
| TUI library compatibility | Low | Low | Ratatui stability | Acceptable |
| Cross-platform path issues | Medium | Medium | Extensive testing | Acceptable |

### Maintenance Risks

| Risk | Probability | Impact | Mitigation | Status |
|------|-------------|--------|------------|--------|
| Crate abandonment | Medium | Medium | Pin versions, fork if needed | Acceptable |
| API breaking changes | Medium | Low | Semantic versioning | Acceptable |
| Security vulnerabilities | Low | High | Regular audits, updates | Acceptable |

### License Risks

| Risk | Probability | Impact | Mitigation | Status |
|------|-------------|--------|------------|--------|
| LGPL contamination | Low | High | Keep separate, dynamic linking | Acceptable |
| License incompatibility | Low | Critical | Review before adding | **Avoided** |

---

## 8. Recommendations

### Implementation Roadmap

#### Phase 1: Foundation (Weeks 1-3)

**Tier 1 Features - Immediate Wins:**

1. **Memory-Mapped File I/O** (Week 1)
   - Add `memmap2` dependency
   - Implement `full_hash_mmap` method
   - Add `--use-mmap` CLI flag
   - Target: 10% performance improvement on large files

2. **Bloom Filters** (Week 1-2)
   - Add `bloom` dependency
   - Implement size and prehash filters
   - Measure false positive rate
   - Target: 30% reduction in hash computation

3. **Multi-Directory Scanning** (Week 2)
   - Refactor CLI to accept multiple paths
   - Parallelize directory traversal
   - Update progress reporting
   - Target: Seamless multi-path UX

4. **Database Backend Options** (Week 2-3)
   - Abstract cache backend trait
   - Implement sled backend
   - Benchmark vs SQLite
   - Target: 2x cache performance improvement

#### Phase 2: Enhancement (Weeks 4-7)

**Tier 2 Features - Medium Priority:**

5. **Perceptual Image Hashing** (Week 4-5)
   - Add `image_hasher` dependency
   - Implement `scanner/perceptual.rs`
   - Add `--similar-images` flag
   - Document threshold tuning

6. **Progressive/Streaming Results** (Week 5-6)
   - Add `crossbeam-channel` dependency
   - Implement streaming result channel
   - Update TUI for real-time updates
   - Target: Results visible within 5 seconds

7. **Real-Time Monitoring** (Week 6-7)
   - Add `notify` dependency
   - Implement `monitor.rs` module
   - Add `rustdupe monitor` subcommand
   - Target: Background duplicate detection

8. **Fuzzy Text Detection** (Week 7)
   - Add `simhash` and `pdf-extract` dependencies
   - Implement document text extraction
   - Add `--similar-documents` flag
   - Target: PDF and plain text support

#### Phase 3: Future Consideration

**Tier 3 Features - Deferred:**

9. **Audio Fingerprinting** - Revisit when pure Rust solution available
10. **GPU-Accelerated Hashing** - Revisit if CPU becomes bottleneck

**Not Recommended:**

11. **Cloud Storage Integration** - Use local sync folders instead
12. **GUI Application** - Maintain TUI focus

### Priority Summary

| Tier | Features | Effort | Value | Timeline |
|------|----------|--------|-------|----------|
| **Tier 1** | Mmap, Bloom, Multi-dir, DB Options | 4-5 weeks | Very High | Immediate |
| **Tier 2** | Perceptual, Streaming, Monitor, Fuzzy | 6-7 weeks | High | Months 2-3 |
| **Tier 3** | Audio, GPU | 10-14 weeks | Medium | Future |
| **Not Rec.** | Cloud, GUI | 14-22 weeks | Low | Never |

### Success Metrics

| Feature | Success Metric |
|---------|----------------|
| Memory-Mapped I/O | 10%+ perf improvement on >100MB files |
| Bloom Filters | 30%+ reduction in unnecessary hashes |
| Multi-Directory | Support 10+ directories seamlessly |
| Database Options | 2x+ cache performance vs SQLite |
| Perceptual Hashing | <5% false positive rate |
| Progressive Results | First results in <5 seconds |
| Real-Time Monitoring | <2 second detection latency |
| Fuzzy Text | Support PDF, DOCX, TXT formats |

### Final Recommendation

**Proceed with all Tier 1 and Tier 2 features** over the next 3-4 months. These provide significant user value with manageable risk and effort. **Defer Tier 3 and Not Recommended features** pending ecosystem maturity or changed requirements.

---

## Appendix A: Crate Recommendations Summary

### Essential Crates (Add Immediately)

```toml
[dependencies]
memmap2 = "0.9"          # Memory-mapped files
bloom = "0.3"            # Bloom filters
sled = "0.34"            # Alternative cache backend
redb = "2.0"             # Alternative cache backend
```

### Recommended Crates (Add for Phase 2)

```toml
[dependencies]
image_hasher = "2.0"     # Perceptual image hashing
image = "0.25"           # Image decoding
notify = "7.0"           # File system monitoring
crossbeam-channel = "0.5" # Multi-producer channels
simhash = "0.2"          # Document similarity
pdf-extract = "0.7"      # PDF text extraction
```

### Not Recommended Crates

| Crate | Reason |
|-------|--------|
| `chromaprint` | No pure Rust solution, LGPL complexity |
| `rustacuda` | Immature, limited maintenance |
| `reqwest` + `oauth2` | Require async runtime |
| `eframe` / `iced` / `tauri` | GUI not aligned with project goals |

---

## Appendix B: References

### Internal Documents

- `docs/research/advanced-duplicate-finder-features-2025-02-05.md`
- `docs/research/cross-platform-file-management-2026-02-05.md`

### External Resources

1. **BLAKE3 Performance:** https://github.com/BLAKE3-team/BLAKE3
2. **image_hasher Crate:** https://crates.io/crates/image_hasher
3. **notify Crate:** https://docs.rs/notify
4. **sled Database:** https://docs.rs/sled
5. **redb Database:** https://docs.rs/redb
6. **memmap2 Crate:** https://docs.rs/memmap2

---

*Document generated: 2026-02-05*  
*Version: 1.0*  
*Next Review: After Phase 1 completion*