# Technical Feasibility Document
## RustDupe Advanced Features Assessment
> **Document Version:** 1.0
> **Date:** 2026-02-05
> **Project:** RustDupe - Smart Duplicate File Finder
> **Architecture:** Rust 1.85+, synchronous with rayon parallelism
---
## Table of Contents
1. [Executive Summary](#1-executive-summary)
2. [Methodology](#2-methodology)
3. [Feasibility Evaluation Matrix](#3-feasibility-evaluation-matrix)
4. [Detailed Feasibility Analysis](#4-detailed-feasibility-analysis)
5. [Architecture Impact Assessment](#5-architecture-impact-assessment)
6. [Dependency Analysis](#6-dependency-analysis)
7. [Risk Assessment Matrix](#7-risk-assessment-matrix)
8. [Recommendations](#8-recommendations)
---
## 1. Executive Summary
This document provides a comprehensive technical feasibility assessment for 12 proposed advanced features for RustDupe. The assessment evaluates each feature against the current synchronous, rayon-based architecture using BLAKE3 hashing, SQLite caching, and ratatui TUI.
### Key Findings
| **Tier 1 (Recommended)** | 4 | Memory-Mapped I/O, Bloom Filters, Multi-Directory Scanning, Database Backend Options |
| **Tier 2 (Conditional)** | 4 | Perceptual Image Hashing, Progressive Results, Real-Time Monitoring, Fuzzy Text Detection |
| **Tier 3 (Future Consideration)** | 2 | Audio Fingerprinting, GPU-Accelerated Hashing |
| **Not Recommended** | 2 | Cloud Storage Integration, GUI Application |
### Critical Dependencies
- **Immediate Wins:** `memmap2`, `bloom`, `sled`/`redb` - Low effort, high value
- **Medium Complexity:** `image_hasher`, `notify`, `crossbeam-channel` - Moderate effort, solid ecosystem
- **High Complexity:** `chromaprint` (FFI), `rustacuda` - Significant integration challenges
---
## 2. Methodology
### Evaluation Criteria
Each feature was evaluated using the following framework:
#### Complexity Assessment
- **Low:** Pure Rust crate, straightforward API, minimal architectural changes
- **Medium:** Requires new module, some API design, moderate testing
- **High:** Complex integration, FFI bindings, or significant architectural refactoring
- **Very High:** New paradigm (async/GPU), major subsystem, external dependencies
#### Risk Factors
1. **Technical Risk:** Implementation uncertainty, performance concerns
2. **Integration Risk:** Compatibility with existing sync+rayon architecture
3. **Maintenance Risk:** Crate maintenance status, API stability
4. **License Risk:** Compatibility with MIT license
#### Value Assessment
- **High:** Significant user benefit, competitive advantage
- **Medium:** Nice-to-have feature
- **Low:** Limited use cases or high effort/reward ratio
### Data Sources
- **Research Reports:**
- `docs/research/advanced-duplicate-finder-features-2025-02-05.md`
- `docs/research/cross-platform-file-management-2026-02-05.md`
- **Crates.io:** Download statistics, maintenance activity, MSRV compatibility
- **Architecture Review:** Current codebase analysis
- **Benchmark Data:** BLAKE3, perceptual hashing, database performance comparisons
---
## 3. Feasibility Evaluation Matrix
| 1. Perceptual Image Hashing | Medium | 2-3 | Medium | `image_hasher`, `image` | **Tier 2** |
| 2. Audio Fingerprinting | Low | 4-6 | High | `chromaprint` (FFI) | **Tier 3** |
| 3. Fuzzy/Near-Duplicate Text | Medium | 2-3 | Medium | `simhash`, `serde_json` | **Tier 2** |
| 4. GPU-Accelerated Hashing | Low | 6-8 | High | `rustacuda`, `cust` | **Tier 3** |
| 5. Memory-Mapped File I/O | **High** | **1** | Low | `memmap2` | **Tier 1** |
| 6. Bloom Filters | **High** | **1** | Low | `bloom` | **Tier 1** |
| 7. Real-Time Monitoring | Medium | 2-3 | Medium | `notify` | **Tier 2** |
| 8. Cloud Storage Integration | Low | 4-6 | High | Multiple OAuth clients | **Not Recommended** |
| 9. GUI Application | Low | 8-12 | Very High | `egui`, `iced`, or `tauri` | **Not Recommended** |
| 10. Multi-Directory Scanning | **High** | **1-2** | Low | None (refactoring) | **Tier 1** |
| 11. Database Backend Options | **High** | **2** | Low | `sled`, `redb` | **Tier 1** |
| 12. Progressive/Streaming Results | Medium | 2-3 | Medium | `crossbeam-channel` | **Tier 2** |
---
## 4. Detailed Feasibility Analysis
### Feature 1: Perceptual Image Hashing
**Detect visually similar images despite compression, resizing, or minor edits.**
#### Technical Approach
Implement a new `scanner::perceptual` module that computes perceptual hashes for image files alongside BLAKE3 content hashes. The perceptual hash acts as a secondary grouping mechanism.
```rust
// Proposed API
pub struct PerceptualHasher {
config: HasherConfig,
}
impl PerceptualHasher {
pub fn compute_hash(&self, path: &Path) -> Result<ImageHash, Error> {
let image = image::open(path)?;
let hasher = self.config.to_hasher();
Ok(hasher.hash_image(&image))
}
pub fn is_similar(&self, hash1: &ImageHash, hash2: &ImageHash, threshold: u32) -> bool {
hash1.dist(hash2) <= threshold
}
}
```
#### Required Crates
| `image_hasher` | 2.0 | Perceptual hashing algorithms | MIT | 500K+ |
| `image` | 0.25 | Image decoding (JPEG, PNG, etc.) | MIT/Apache-2.0 | 100M+ |
| `bk-tree` | 0.5 | Efficient similarity search | MIT | 1M+ |
#### Integration Points
1. **Scanner Module:** Add `perceptual_hash` field to `FileEntry`
2. **Duplicate Finder:** Extend grouping logic to support perceptual similarity
3. **CLI:** Add `--similar-images` flag with threshold parameter
4. **Cache:** Add perceptual hash column to SQLite schema
#### Risks and Challenges
| Image decoding failures | Medium | Graceful error handling, skip corrupted images |
| Memory usage for large images | Medium | Limit image size, use streaming when possible |
| False positive rate | Medium | Configurable threshold (default: dHash with threshold=2) |
| Performance on large image sets | Low | Parallelize with rayon (existing infrastructure) |
#### Effort Estimate Breakdown
- **Core implementation:** 3-4 days
- **CLI integration:** 1-2 days
- **Cache integration:** 2-3 days
- **Testing (unit + integration):** 2-3 days
- **Documentation:** 1 day
- **Total:** 2-3 weeks
#### Prototype/POC Suggestion
```rust
// 1-day proof of concept
use image_hasher::{HasherConfig, HashAlg};
fn demo_perceptual_hash(image_path: &Path) {
let hasher = HasherConfig::new()
.hash_alg(HashAlg::Gradient) // dHash variant
.hash_size(8, 8)
.to_hasher();
let image = image::open(image_path).unwrap();
let hash = hasher.hash_image(&image);
println!("Hash: {:?}", hash.to_base64());
}
```
#### Go/No-Go Recommendation
🟡 **GO with conditions:**
- Start with dHash algorithm only (simpler than pHash)
- Require explicit opt-in via flag
- Document performance implications
- Consider as "beta" feature initially
---
### Feature 2: Audio Fingerprinting
**Detect duplicate music files even across different encodings or bitrates.**
#### Technical Approach
Use Chromaprint (AcoustID) via FFI to compute audio fingerprints. This requires external C library or CLI tool integration.
```rust
// Two approaches:
// Option A: FFI bindings (complex)
extern "C" {
fn chromaprint_decode_fingerprint(...);
}
// Option B: CLI wrapper (simpler)
fn get_audio_fingerprint(path: &Path) -> Result<String> {
let output = Command::new("fpcalc")
.args(["-json", path.to_str().unwrap()])
.output()?;
let json: serde_json::Value = serde_json::from_slice(&output.stdout)?;
Ok(json["fingerprint"].as_str().unwrap().to_string())
}
```
#### Required Crates
| `chromaprint` | N/A | Audio fingerprinting | LGPL | No pure Rust crate available |
| `chromaprint-sys` | N/A | FFI bindings | - | Would need to create/maintain |
| `serde_json` | 1.x | Parse fpcalc output | MIT/Apache | Already in dependencies |
#### Integration Points
1. **New Module:** `scanner::audio` for audio-specific processing
2. **Feature Flag:** `audio` (optional compilation)
3. **Dependency:** Runtime dependency on `fpcalc` binary
4. **Cache:** New column for audio fingerprint
#### Risks and Challenges
| No mature Rust crate | High | Requires FFI or external binary |
| LGPL licensing | Medium | Chromaprint is LGPL (compatible but complex) |
| External binary dependency | High | Users must install fpcalc |
| Cross-platform distribution | High | Complex packaging requirements |
| Performance (decoding audio) | Medium | ~100ms per file, needs parallelization |
#### Effort Estimate Breakdown
- **FFI investigation:** 3-5 days
- **Chromaprint integration:** 5-7 days
- **Cross-platform build setup:** 3-5 days
- **Testing across platforms:** 3-5 days
- **Documentation/packaging:** 2-3 days
- **Total:** 4-6 weeks
#### Go/No-Go Recommendation
🔴 **NO-GO for now:**
**Rationale:**
- No pure Rust solution available
- Requires external C library integration
- LGPL licensing complications
- High maintenance burden
**Alternative:** Document workaround using external tools (e.g., run `fpcalc` separately, import results)
---
### Feature 3: Fuzzy/Near-Duplicate Text Detection
**Detect similar documents using text-based similarity algorithms.**
#### Technical Approach
Extract text from documents (PDF, DOCX, TXT) and compute SimHash or MinHash fingerprints for near-duplicate detection.
```rust
pub struct DocumentHasher {
algorithm: DocumentHashAlgorithm,
}
pub enum DocumentHashAlgorithm {
SimHash, // 64-bit fingerprint, Hamming distance
MinHash, // Set similarity, Jaccard index
}
impl DocumentHasher {
pub fn hash_document(&self, path: &Path) -> Result<DocumentHash> {
let text = extract_text(path)?;
match self.algorithm {
DocumentHashAlgorithm::SimHash => self.simhash(&text),
DocumentHashAlgorithm::MinHash => self.minhash(&text),
}
}
}
```
#### Required Crates
| `simhash` | 0.2 | SimHash implementation | MIT | 100K+ |
| `pdf-extract` | 0.7 | PDF text extraction | MIT | 200K+ |
| `docx-rs` | 0.4 | DOCX parsing | MIT | 300K+ |
| `loole` | 0.1 | Shingling support | MIT | 50K+ |
#### Integration Points
1. **New Module:** `scanner::document`
2. **Text Extraction Pipeline:** PDF → DOCX → Plain text
3. **Similarity Threshold:** Configurable Hamming distance
4. **CLI:** `--similar-documents` flag
#### Risks and Challenges
| Text extraction failures | Medium | Fallback to content hash |
| Document format support | Medium | Start with PDF/TXT, add DOCX later |
| Unicode normalization | Low | Use existing `unicode-normalization` crate |
| Performance | Low | Rayon parallelization |
#### Effort Estimate Breakdown
- **Text extraction framework:** 3-4 days
- **SimHash implementation:** 2-3 days
- **Similarity matching:** 2-3 days
- **CLI integration:** 1-2 days
- **Testing:** 2-3 days
- **Total:** 2-3 weeks
#### Go/No-Go Recommendation
🟡 **GO with scope limitation:**
**Recommended scope:**
- Start with plain text and PDF only
- Use SimHash (simpler than MinHash)
- Require opt-in via `--similar-documents`
- Document limitations
---
### Feature 4: GPU-Accelerated Hashing
**Use CUDA/OpenCL to accelerate hash computation for large files.**
#### Technical Approach
Offload BLAKE3 hashing to GPU for large files. However, BLAKE3 is already extremely fast on CPU with SIMD (~8.4 GB/s single-thread).
```rust
#[cfg(feature = "cuda")]
pub struct GpuHasher {
context: cuda::Context,
}
impl GpuHasher {
pub fn hash_file(&self, path: &Path) -> Result<Hash> {
// Transfer file to GPU memory
// Launch BLAKE3 kernel
// Retrieve result
}
}
```
#### Required Crates
| `rustacuda` | 0.2 | CUDA bindings | MIT/Apache | Limited maintenance |
| `cust` | 0.3 | Safe CUDA wrapper | MIT/Apache | Newer alternative |
| `opencl3` | 0.9 | OpenCL bindings | MIT/Apache | Cross-platform |
#### Integration Points
1. **New Module:** `scanner::gpu`
2. **Feature Flag:** `cuda` or `opencl`
3. **Runtime Detection:** Check GPU availability
4. **Fallback:** CPU hashing if GPU unavailable
#### Risks and Challenges
| Limited benefit | Very High | BLAKE3 @ 8.4 GB/s already saturates most storage |
| Cross-platform complexity | High | Different GPU vendors, drivers |
| Build complexity | High | Requires CUDA toolkit for compilation |
| Runtime dependencies | High | Users need GPU drivers |
| Maintenance | High | GPU crate ecosystem is immature |
#### Performance Reality Check
| NVMe SSD (3.5 GB/s) | 8.4 GB/s | Limited by storage | Minimal |
| RAM disk (10 GB/s) | 8.4 GB/s | ~50 GB/s | Moderate |
| Network storage | 1 Gbps | Not applicable | None |
#### Effort Estimate Breakdown
- **CUDA research/setup:** 5-7 days
- **BLAKE3 GPU kernel:** 5-7 days
- **Integration with scanner:** 3-5 days
- **Cross-platform testing:** 5-7 days
- **Documentation:** 2-3 days
- **Total:** 6-8 weeks
#### Go/No-Go Recommendation
🔴 **NO-GO:**
**Rationale:**
- BLAKE3 on CPU is already faster than most storage
- Diminishing returns for typical use cases
- High complexity, immature Rust GPU ecosystem
- Maintenance burden not justified by benefit
---
### Feature 5: Memory-Mapped File I/O
**Use memory-mapped files for improved I/O performance.**
#### Technical Approach
Replace buffered file reading with memory-mapped I/O for large files. BLAKE3 supports `update_mmap` for this purpose.
```rust
use memmap2::Mmap;
pub fn hash_file_mmap(path: &Path) -> Result<Hash> {
let file = File::open(path)?;
let mmap = unsafe { Mmap::map(&file)? };
let mut hasher = blake3::Hasher::new();
hasher.update(&mmap);
Ok(*hasher.finalize().as_bytes())
}
// Even simpler with BLAKE3's built-in support
pub fn hash_file_mmap_rayon(path: &Path) -> Result<Hash> {
let mut hasher = blake3::Hasher::new();
hasher.update_mmap_rayon(path)?; // Memory-mapped + parallel
Ok(*hasher.finalize().as_bytes())
}
```
#### Required Crates
| `memmap2` | 0.9 | Memory-mapped files | MIT/Apache | 200M+ |
#### Integration Points
1. **Scanner Module:** Add `full_hash_mmap` method to `Hasher`
2. **Configuration:** Add `--use-mmap` CLI flag
3. **Heuristics:** Use mmap for files > 64KB (avoid overhead for small files)
#### Risks and Challenges
| `unsafe` code requirement | Low | Well-established crate, minimal unsafe block |
| Network filesystem issues | Medium | Detect and fallback to buffered I/O |
| Small file overhead | Low | Only use mmap for files > threshold |
| File modification during hash | Low | OS handles consistency |
#### Effort Estimate Breakdown
- **Implementation:** 1-2 days
- **CLI integration:** 0.5 days
- **Testing:** 1-2 days
- **Documentation:** 0.5 days
- **Total:** 1 week
#### Go/No-Go Recommendation
🟢 **GO - High Priority:**
**Rationale:**
- Minimal effort (1 week)
- Well-established crate (`memmap2`)
- Potential performance improvement for large files
- BLAKE3 has built-in support
- Low risk
---
### Feature 6: Bloom Filters
**Use probabilistic data structures for quick rejection of non-duplicates.**
#### Technical Approach
Use Bloom filters to quickly determine if a file size or prehash has been seen before, avoiding unnecessary full hash computation.
```rust
use bloom::BloomFilter;
pub struct DuplicateDetector {
size_filter: BloomFilter, // File sizes seen
prehash_filter: BloomFilter, // Prehashes seen
}
impl DuplicateDetector {
pub fn might_be_duplicate(&self, entry: &FileEntry, prehash: &Hash) -> bool {
// Fast path: check if size exists
if !self.size_filter.contains(&entry.size) {
self.size_filter.insert(&entry.size);
return false; // Definitely not a duplicate
}
// Check prehash
if !self.prehash_filter.contains(prehash) {
self.prehash_filter.insert(prehash);
return false; // Definitely not a duplicate
}
true // Possibly a duplicate, needs full hash
}
}
```
#### Required Crates
| `bloom` | 0.3 | Bloom filter | MIT | 5M+ |
| `growable-bloom-filter` | 2.0 | Auto-resizing variant | MIT | 100K+ |
#### Integration Points
1. **New Module:** `duplicates::bloom` or extend `finder.rs`
2. **Configuration:** False positive rate (default: 1%)
3. **Memory Management:** Estimate filter size based on expected file count
#### Risks and Challenges
| Memory usage | Low | Configurable FP rate trades memory for accuracy |
| False positives | Low | 1% FP rate acceptable (full hash verifies) |
| Filter sizing | Low | Can use growable variant |
#### Effort Estimate Breakdown
- **Implementation:** 2-3 days
- **Integration:** 1 day
- **Testing:** 1-2 days
- **Documentation:** 0.5 days
- **Total:** 1 week
#### Go/No-Go Recommendation
🟢 **GO - High Priority:**
**Rationale:**
- Very low effort
- Significant performance improvement for large scans
- Reduces unnecessary hash computation
- Well-established algorithm and crates
---
### Feature 7: Real-Time Monitoring
**Watch directories for changes and detect new duplicates automatically.**
#### Technical Approach
Use the `notify` crate for cross-platform filesystem monitoring with debouncing.
```rust
use notify::{Watcher, RecursiveMode, watcher};
use std::sync::mpsc::channel;
pub struct DirectoryMonitor {
watcher: RecommendedWatcher,
rx: Receiver<DebouncedEvent>,
}
impl DirectoryMonitor {
pub fn new(paths: &[PathBuf]) -> Result<Self> {
let (tx, rx) = channel();
let mut watcher = watcher(tx, Duration::from_secs(2))?;
for path in paths {
watcher.watch(path, RecursiveMode::Recursive)?;
}
Ok(Self { watcher, rx })
}
pub fn run(&self, mut callback: impl FnMut(&Path)) -> Result<()> {
loop {
match self.rx.recv() {
Ok(DebouncedEvent::Create(path)) |
Ok(DebouncedEvent::Write(path)) => {
callback(&path);
}
Ok(DebouncedEvent::Remove(path)) => {
// Invalidate cache entry
}
Err(e) => return Err(e.into()),
_ => {}
}
}
}
}
```
#### Required Crates
| `notify` | 7.0 | Cross-platform file watching | CC0-1.0 | 50M+ |
#### Integration Points
1. **New Module:** `monitor.rs` at crate root
2. **CLI:** `rustdupe monitor <paths...>` subcommand
3. **Session Integration:** Resume from cache on startup
4. **TUI Integration:** Real-time duplicate notifications
#### Risks and Challenges
| Platform differences | Medium | `notify` abstracts most differences |
| High event volume | Medium | Debouncing already included |
| Recursive watching limits | Low | Document OS-specific limits |
| Battery impact (laptops) | Low | Only watch when explicitly requested |
#### Effort Estimate Breakdown
- **Core monitoring:** 2-3 days
- **Integration with scanner:** 2-3 days
- **CLI subcommand:** 1 day
- **Testing across platforms:** 2-3 days
- **Documentation:** 1 day
- **Total:** 2-3 weeks
#### Go/No-Go Recommendation
🟡 **GO - Medium Priority:**
**Rationale:**
- Well-established crate
- Useful for ongoing duplicate management
- Moderate effort
- Consider as background service feature
---
### Feature 8: Cloud Storage Integration
**Direct integration with Dropbox, OneDrive, Google Drive APIs.**
#### Technical Approach
Access cloud storage APIs to scan files without downloading locally.
```rust
pub enum CloudProvider {
Dropbox,
GoogleDrive,
OneDrive,
}
pub struct CloudScanner {
provider: CloudProvider,
client: Box<dyn CloudClient>,
}
#[async_trait]
pub trait CloudClient {
async fn list_files(&self, path: &str) -> Result<Vec<CloudFile>>;
async fn get_hash(&self, file_id: &str) -> Result<Option<String>>;
}
```
#### Required Crates
| `reqwest` | 0.12 | HTTP client | MIT/Apache | High |
| `oauth2` | 4.4 | OAuth authentication | MIT/Apache | High |
| `serde_json` | 1.x | API response parsing | MIT/Apache | Already have |
| Various SDKs | - | Provider-specific APIs | Varies | Very High |
#### Integration Points
1. **New Module:** `cloud.rs` with provider submodules
2. **Authentication:** OAuth flow for each provider
3. **Async Runtime:** Requires `tokio` (conflicts with sync architecture)
4. **Cache:** Store cloud file metadata
#### Risks and Challenges
| Requires async runtime | Critical | Conflicts with sync+rayon architecture |
| OAuth complexity | High | User authentication flow |
| API rate limits | High | Throttling, quotas |
| Provider API changes | Medium | Breaking changes, maintenance |
| Security considerations | High | Token storage, encryption |
| Scope explosion | High | 3+ providers with different APIs |
#### Effort Estimate Breakdown
- **Architecture changes (async):** 2-3 weeks
- **OAuth implementation:** 1-2 weeks
- **Dropbox integration:** 1 week
- **Google Drive integration:** 1-2 weeks
- **OneDrive integration:** 1-2 weeks
- **Testing:** 1-2 weeks
- **Documentation:** 3-5 days
- **Total:** 6-10 weeks
#### Go/No-Go Recommendation
🔴 **NO-GO:**
**Rationale:**
- Requires async runtime (major architectural change)
- High complexity across multiple providers
- Rate limiting and API maintenance burden
- Alternative exists: scan local sync folders
**Alternative:**
- Document how to scan Dropbox/Google Drive local folders
- Users get same result with simpler implementation
---
### Feature 9: GUI Application
**Native GUI alongside TUI using egui, iced, or tauri.**
#### Technical Approach
Create a separate GUI binary or make TUI/GUI interchangeable.
**Option A: egui (Immediate Mode)**
```rust
use eframe::egui;
pub struct RustDupeApp {
scan_state: ScanState,
duplicates: Vec<DuplicateGroup>,
}
impl eframe::App for RustDupeApp {
fn update(&mut self, ctx: &egui::Context, _frame: &mut eframe::Frame) {
egui::CentralPanel::default().show(ctx, |ui| {
ui.heading("RustDupe");
// GUI implementation
});
}
}
```
**Option B: Tauri (Web-based)**
- React/Vue frontend
- Rust backend via Tauri commands
#### Required Crates
| `eframe` | 0.29 | egui framework | MIT/Apache | +2-5 MB |
| `iced` | 0.12 | Elm architecture | MIT | +3-6 MB |
| `tauri` | 2.0 | Web-based GUI | MIT/Apache | +5-15 MB |
| `dioxus` | 0.5 | React-like framework | MIT/Apache | +3-8 MB |
#### Integration Points
1. **Separate Binary:** `rustdupe-gui` crate
2. **Shared Library:** Core logic in `rustdupe-lib`
3. **UI Abstraction:** Common interface for TUI and GUI
#### Risks and Challenges
| Binary size increase | High | +50-200% size increase |
| Maintenance burden | Very High | Two UIs to maintain |
| Architecture complexity | High | Shared core abstraction |
| Feature parity | Medium | GUI may lag behind TUI |
| Cross-platform testing | High | GUI behavior varies by OS |
#### Effort Estimate Breakdown
- **Architecture refactoring:** 1-2 weeks
- **GUI framework setup:** 1 week
- **Core GUI implementation:** 3-4 weeks
- **Feature parity with TUI:** 2-3 weeks
- **Testing across platforms:** 2-3 weeks
- **Documentation:** 1 week
- **Total:** 8-12 weeks
#### Go/No-Go Recommendation
🔴 **NO-GO:**
**Rationale:**
- TUI is already excellent (ratatui is mature)
- High effort, high maintenance
- Binary size increase
- Would split development focus
**Alternative:**
- Improve TUI further (already planned)
- Consider GUI as separate future project
---
### Feature 10: Multi-Directory Scanning
**Scan multiple root directories simultaneously.**
#### Technical Approach
Extend CLI and scanner to accept multiple paths, merging results.
```rust
// Current
pub fn scan(path: &Path, config: &ScanConfig) -> Result<Vec<FileEntry>>;
// Proposed
pub fn scan_multiple(paths: &[PathBuf], config: &ScanConfig) -> Result<ScanResult> {
let all_entries: Vec<FileEntry> = paths
.par_iter() // rayon parallel iteration
.map(|path| scan_single(path, config))
.collect::<Result<Vec<_>>>()?
.into_iter()
.flatten()
.collect();
// Deduplicate across all paths
find_duplicates(&all_entries)
}
```
#### Required Crates
- **None** - Uses existing `rayon` infrastructure
#### Integration Points
1. **CLI:** Accept multiple `<PATH>` arguments
2. **Scanner:** Parallelize across directories
3. **Progress:** Aggregate progress from all paths
4. **Cache:** Per-directory or unified cache
#### Risks and Challenges
| Cross-device duplicates | Low | Handle different filesystems |
| Path conflicts | Low | Use canonical paths |
| Memory usage | Low | Stream results, don't buffer all |
#### Effort Estimate Breakdown
- **CLI changes:** 1-2 days
- **Scanner refactoring:** 2-3 days
- **Progress integration:** 1-2 days
- **Testing:** 1-2 days
- **Documentation:** 0.5 days
- **Total:** 1-2 weeks
#### Go/No-Go Recommendation
🟢 **GO - High Priority:**
**Rationale:**
- Very low effort
- High user value
- Natural extension of existing architecture
- No new dependencies
---
### Feature 11: Database Backend Options
**Support sled or redb as alternatives to SQLite.**
#### Technical Approach
Abstract cache backend to support multiple database implementations.
```rust
pub trait CacheBackend {
fn get(&self, key: &CacheKey) -> Result<Option<CacheEntry>>;
fn put(&self, key: &CacheKey, entry: &CacheEntry) -> Result<()>;
fn delete(&self, key: &CacheKey) -> Result<()>;
}
pub enum CacheBackendType {
Sqlite, // Current
Sled, // Pure Rust key-value
Redb, // Rust-native B-tree
}
// Usage
pub struct HashCache {
backend: Box<dyn CacheBackend>,
}
```
#### Required Crates
| `sled` | 0.34 | Pure Rust key-value | MIT/Apache | Very Fast |
| `redb` | 2.0 | Rust-native B-tree | MIT/Apache | Fast |
| `rusqlite` | 0.31 | Current SQLite | MIT | Good |
#### Comparison
| Pure Rust | No (C) | Yes | Yes |
| Performance | Good | Excellent | Very Good |
| Binary Size | +500KB | +300KB | +400KB |
| ACID | Yes | Yes | Yes |
| Maintenance | Excellent | Good | Good |
#### Integration Points
1. **Cache Module:** Refactor to trait-based backend
2. **Configuration:** `--cache-backend` option
3. **Migration:** Export/import between backends
#### Risks and Challenges
| Data migration | Medium | Provide export/import tools |
| API differences | Low | Abstract with trait |
| Testing matrix | Medium | Test all backends in CI |
#### Effort Estimate Breakdown
- **Trait abstraction:** 2-3 days
- **Sled implementation:** 2-3 days
- **Redb implementation:** 2-3 days
- **Migration tools:** 2-3 days
- **Testing:** 2-3 days
- **Documentation:** 1-2 days
- **Total:** 2 weeks
#### Go/No-Go Recommendation
🟢 **GO - High Priority:**
**Rationale:**
- Low to medium effort
- Removes C dependency (SQLite)
- Better performance with sled/redb
- Educational value for Rust ecosystem
---
### Feature 12: Progressive/Streaming Results
**Show duplicate results as they're found instead of waiting for completion.**
#### Technical Approach
Use channels to stream results from scanner to UI in real-time.
```rust
use crossbeam_channel::{Sender, Receiver};
pub struct StreamingScanner {
result_tx: Sender<DuplicateGroup>,
}
impl StreamingScanner {
pub fn scan_streaming(&self, paths: &[PathBuf]) -> Result<()> {
let (file_tx, file_rx) = crossbeam_channel::unbounded();
// Spawn file walker thread
spawn(move || {
for file in walk_files(paths) {
file_tx.send(file).unwrap();
}
});
// Process and stream results
for file in file_rx {
if let Some(duplicate) = self.process_file(file) {
self.result_tx.send(duplicate).unwrap();
}
}
Ok(())
}
}
```
#### Required Crates
| `crossbeam-channel` | 0.5 | Multi-producer multi-consumer channels | MIT/Apache | 200M+ |
#### Integration Points
1. **Scanner:** Streaming result channel
2. **TUI:** Real-time result updates
3. **Output:** Stream JSON/CSV as results arrive
4. **Progress:** Merge with result streaming
#### Risks and Challenges
| UI complexity | Medium | Ratatui supports dynamic updates |
| Ordering guarantees | Low | Accept eventual consistency |
| Memory pressure | Low | Bounded channels |
| Error handling | Medium | Send errors through channel |
#### Effort Estimate Breakdown
- **Channel architecture:** 2-3 days
- **Scanner integration:** 2-3 days
- **TUI real-time updates:** 3-4 days
- **Output streaming:** 1-2 days
- **Testing:** 2-3 days
- **Documentation:** 1 day
- **Total:** 2-3 weeks
#### Go/No-Go Recommendation
🟡 **GO - Medium Priority:**
**Rationale:**
- Good user experience improvement
- Well-established channel patterns
- Moderate effort
- Can be implemented incrementally
---
## 5. Architecture Impact Assessment
### Current Architecture
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ CLI │────▶│ Config │────▶│ Scanner │
└─────────────┘ └─────────────┘ └──────┬──────┘
│
┌─────────────┐ │
│ Cache │◀────────┤
│ (SQLite) │ │
└─────────────┘ │
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Output │◀────│ Duplicate │◀────│ Hasher │
│ (JSON/CSV) │ │ Finder │ │ (BLAKE3) │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ TUI │
│ (ratatui) │
└─────────────┘
```
### Feature Impact Summary
| Perceptual Hashing | `scanner/`, `duplicates/` | No | Yes |
| Audio Fingerprinting | New `scanner/audio/` | No | Yes (feature flag) |
| Fuzzy Text Detection | New `scanner/document/` | No | Yes |
| GPU Hashing | New `scanner/gpu/` | No | Yes (feature flag) |
| Memory-Mapped I/O | `scanner/hasher.rs` | No | Yes (opt-in) |
| Bloom Filters | `duplicates/finder.rs` | No | Yes |
| Real-Time Monitoring | New `monitor.rs` | No | Yes |
| Cloud Integration | New `cloud/` + async | **Yes** | No |
| GUI Application | New crate `rustdupe-gui` | No | N/A |
| Multi-Directory | `cli.rs`, `scanner/` | No | Yes |
| Database Options | `cache/` | No | Yes |
| Progressive Results | `scanner/`, `tui/` | No | Yes |
### Critical Architecture Decision
⚠️ **Cloud Storage Integration requires async runtime**, which conflicts with the current synchronous + rayon architecture. This is the only feature requiring fundamental architectural changes.
---
## 6. Dependency Analysis
### New Dependencies Required
#### Tier 1 Features (Immediate Wins)
| `memmap2` | 0.9 | MIT/Apache | Small | Excellent |
| `bloom` | 0.3 | MIT | Tiny | Good |
| `sled` | 0.34 | MIT/Apache | Medium | Good |
| `redb` | 2.0 | MIT | Medium | Good |
#### Tier 2 Features (Medium Term)
| `image_hasher` | 2.0 | MIT | Medium | Active development |
| `image` | 0.25 | MIT/Apache | Large | Very popular |
| `notify` | 7.0 | CC0-1.0 | Small | Excellent |
| `crossbeam-channel` | 0.5 | MIT/Apache | Small | Essential |
| `simhash` | 0.2 | MIT | Tiny | Stable |
| `pdf-extract` | 0.7 | MIT | Medium | Moderate activity |
#### Tier 3 Features (Future)
| `chromaprint` | N/A | LGPL | - | FFI complexity |
| `rustacuda` | 0.2 | MIT/Apache | Large | Limited maintenance |
| `reqwest` | 0.12 | MIT/Apache | Large | Async required |
| `oauth2` | 4.4 | MIT/Apache | Medium | Async required |
### License Compatibility Matrix
| Memory-Mapped I/O | `memmap2` | MIT/Apache | ✅ Yes |
| Bloom Filters | `bloom` | MIT | ✅ Yes |
| Database Options | `sled`, `redb` | MIT/Apache | ✅ Yes |
| Perceptual Hashing | `image_hasher`, `image` | MIT/Apache | ✅ Yes |
| Real-Time Monitoring | `notify` | CC0-1.0 | ✅ Yes |
| Progressive Results | `crossbeam-channel` | MIT/Apache | ✅ Yes |
| Fuzzy Text | `simhash`, `pdf-extract` | MIT | ✅ Yes |
| Audio Fingerprinting | `chromaprint` | LGPL | ⚠️ Complex |
| GPU Hashing | `rustacuda` | MIT/Apache | ✅ Yes |
| Cloud Integration | `reqwest`, `oauth2` | MIT/Apache | ✅ Yes |
### Maintenance Quality Assessment
| `memmap2` | 2024-11 | 200M+ | ⭐⭐⭐⭐⭐ |
| `bloom` | 2023-08 | 5M+ | ⭐⭐⭐ |
| `sled` | 2023-01 | 5M+ | ⭐⭐⭐ |
| `redb` | 2024-12 | 1M+ | ⭐⭐⭐⭐ |
| `image_hasher` | 2024-06 | 500K+ | ⭐⭐⭐⭐ |
| `image` | 2024-12 | 100M+ | ⭐⭐⭐⭐⭐ |
| `notify` | 2024-11 | 50M+ | ⭐⭐⭐⭐⭐ |
| `crossbeam-channel` | 2024-08 | 200M+ | ⭐⭐⭐⭐⭐ |
| `chromaprint` | N/A | N/A | ⭐ (FFI complexity) |
| `rustacuda` | 2022-06 | 200K+ | ⭐⭐ |
---
## 7. Risk Assessment Matrix
### Technical Risks
| Image decoding failures | Medium | Low | Graceful fallback | Acceptable |
| Bloom filter sizing | Low | Low | Growable variant | Acceptable |
| Memory-mapped file safety | Low | Medium | `unsafe` audit | Acceptable |
| Cache backend migration | Medium | Medium | Export tools | Acceptable |
| Channel backpressure | Medium | Low | Bounded channels | Acceptable |
### Integration Risks
| Sync/async conflict | High | Critical | Avoid async features | **Avoided** |
| Feature flag complexity | Medium | Low | CI testing | Acceptable |
| TUI library compatibility | Low | Low | Ratatui stability | Acceptable |
| Cross-platform path issues | Medium | Medium | Extensive testing | Acceptable |
### Maintenance Risks
| Crate abandonment | Medium | Medium | Pin versions, fork if needed | Acceptable |
| API breaking changes | Medium | Low | Semantic versioning | Acceptable |
| Security vulnerabilities | Low | High | Regular audits, updates | Acceptable |
### License Risks
| LGPL contamination | Low | High | Keep separate, dynamic linking | Acceptable |
| License incompatibility | Low | Critical | Review before adding | **Avoided** |
---
## 8. Recommendations
### Implementation Roadmap
#### Phase 1: Foundation (Weeks 1-3)
**Tier 1 Features - Immediate Wins:**
1. **Memory-Mapped File I/O** (Week 1)
- Add `memmap2` dependency
- Implement `full_hash_mmap` method
- Add `--use-mmap` CLI flag
- Target: 10% performance improvement on large files
2. **Bloom Filters** (Week 1-2)
- Add `bloom` dependency
- Implement size and prehash filters
- Measure false positive rate
- Target: 30% reduction in hash computation
3. **Multi-Directory Scanning** (Week 2)
- Refactor CLI to accept multiple paths
- Parallelize directory traversal
- Update progress reporting
- Target: Seamless multi-path UX
4. **Database Backend Options** (Week 2-3)
- Abstract cache backend trait
- Implement sled backend
- Benchmark vs SQLite
- Target: 2x cache performance improvement
#### Phase 2: Enhancement (Weeks 4-7)
**Tier 2 Features - Medium Priority:**
5. **Perceptual Image Hashing** (Week 4-5)
- Add `image_hasher` dependency
- Implement `scanner/perceptual.rs`
- Add `--similar-images` flag
- Document threshold tuning
6. **Progressive/Streaming Results** (Week 5-6)
- Add `crossbeam-channel` dependency
- Implement streaming result channel
- Update TUI for real-time updates
- Target: Results visible within 5 seconds
7. **Real-Time Monitoring** (Week 6-7)
- Add `notify` dependency
- Implement `monitor.rs` module
- Add `rustdupe monitor` subcommand
- Target: Background duplicate detection
8. **Fuzzy Text Detection** (Week 7)
- Add `simhash` and `pdf-extract` dependencies
- Implement document text extraction
- Add `--similar-documents` flag
- Target: PDF and plain text support
#### Phase 3: Future Consideration
**Tier 3 Features - Deferred:**
9. **Audio Fingerprinting** - Revisit when pure Rust solution available
10. **GPU-Accelerated Hashing** - Revisit if CPU becomes bottleneck
**Not Recommended:**
11. **Cloud Storage Integration** - Use local sync folders instead
12. **GUI Application** - Maintain TUI focus
### Priority Summary
| **Tier 1** | Mmap, Bloom, Multi-dir, DB Options | 4-5 weeks | Very High | Immediate |
| **Tier 2** | Perceptual, Streaming, Monitor, Fuzzy | 6-7 weeks | High | Months 2-3 |
| **Tier 3** | Audio, GPU | 10-14 weeks | Medium | Future |
| **Not Rec.** | Cloud, GUI | 14-22 weeks | Low | Never |
### Success Metrics
| Memory-Mapped I/O | 10%+ perf improvement on >100MB files |
| Bloom Filters | 30%+ reduction in unnecessary hashes |
| Multi-Directory | Support 10+ directories seamlessly |
| Database Options | 2x+ cache performance vs SQLite |
| Perceptual Hashing | <5% false positive rate |
| Progressive Results | First results in <5 seconds |
| Real-Time Monitoring | <2 second detection latency |
| Fuzzy Text | Support PDF, DOCX, TXT formats |
### Final Recommendation
**Proceed with all Tier 1 and Tier 2 features** over the next 3-4 months. These provide significant user value with manageable risk and effort. **Defer Tier 3 and Not Recommended features** pending ecosystem maturity or changed requirements.
---
## Appendix A: Crate Recommendations Summary
### Essential Crates (Add Immediately)
```toml
[dependencies]
memmap2 = "0.9" # Memory-mapped files
bloom = "0.3" # Bloom filters
sled = "0.34" # Alternative cache backend
redb = "2.0" # Alternative cache backend
```
### Recommended Crates (Add for Phase 2)
```toml
[dependencies]
image_hasher = "2.0" # Perceptual image hashing
image = "0.25" # Image decoding
notify = "7.0" # File system monitoring
crossbeam-channel = "0.5" # Multi-producer channels
simhash = "0.2" # Document similarity
pdf-extract = "0.7" # PDF text extraction
```
### Not Recommended Crates
| `chromaprint` | No pure Rust solution, LGPL complexity |
| `rustacuda` | Immature, limited maintenance |
| `reqwest` + `oauth2` | Require async runtime |
| `eframe` / `iced` / `tauri` | GUI not aligned with project goals |
---
## Appendix B: References
### Internal Documents
- `docs/research/advanced-duplicate-finder-features-2025-02-05.md`
- `docs/research/cross-platform-file-management-2026-02-05.md`
### External Resources
1. **BLAKE3 Performance:** https://github.com/BLAKE3-team/BLAKE3
2. **image_hasher Crate:** https://crates.io/crates/image_hasher
3. **notify Crate:** https://docs.rs/notify
4. **sled Database:** https://docs.rs/sled
5. **redb Database:** https://docs.rs/redb
6. **memmap2 Crate:** https://docs.rs/memmap2
---
*Document generated: 2026-02-05*
*Version: 1.0*
*Next Review: After Phase 1 completion*