dupe-cli 0.1.0

Cross-language duplicate code detector CLI tool
# PolyDup CLI

Command-line interface for **PolyDup**, the cross-language duplicate code detector.

## Installation

### From Source

```bash
cd crates/dupe-cli
cargo build --release

# Binary will be at: target/release/polydup
```

### System-wide Installation

```bash
cargo install --path crates/dupe-cli

# Or from the workspace root:
cargo install --path .
```

## Usage

### Basic Scan

```bash
polydup ./src
```

### Scan Multiple Paths

```bash
polydup ./src ./lib ./tests
```

### Adjust Detection Parameters

```bash
# Set minimum block size (default: 50 tokens)
polydup ./src --threshold 30

# Set similarity threshold (default: 0.85 = 85%)
polydup ./src --similarity 0.9

# Combine both
polydup ./src --threshold 30 --similarity 0.9
```

### Output Formats

**Text output (default):**
```bash
polydup ./src
```

Output:
```
📊 Scan Results
═══════════════════════════════════════════════════════════
Files scanned:      4
Functions analyzed: 45
Duplicates found:   0

✅ No duplicates found!
```

**JSON output (for scripting):**
```bash
polydup ./src --format json
```

Output:
```json
{
  "files_scanned": 4,
  "functions_analyzed": 45,
  "duplicates": [],
  "stats": {
    "total_lines": 0,
    "total_tokens": 3665,
    "unique_hashes": 2666,
    "duration_ms": 8
  }
}
```

### Verbose Mode

Show additional performance metrics:

```bash
polydup ./src --verbose
```

Output includes:
- Total tokens processed
- Number of unique hashes
- Scan duration

## Command-Line Options

```
polydup [OPTIONS] <PATHS>...

Arguments:
  <PATHS>...  Paths to scan (files or directories)

Options:
  -f, --format <FORMAT>
          Output format [default: text] [possible values: text, json]
  
  -t, --threshold <MIN_BLOCK_SIZE>
          Minimum code block size in tokens [default: 50]
  
  -s, --similarity <SIMILARITY>
          Similarity threshold (0.0-1.0) [default: 0.85]
  
  -v, --verbose
          Show verbose output
  
  -h, --help
          Print help
  
  -V, --version
          Print version
```

## Exit Codes

- **0**: No duplicates found
- **1**: Duplicates found (or error occurred)

This allows usage in CI/CD pipelines:

```bash
#!/bin/bash
if ! polydup ./src --threshold 100; then
    echo "❌ Duplicates detected!"
    exit 1
fi
echo "✅ No duplicates!"
```

## Examples

### CI/CD Integration

**GitHub Actions:**

```yaml
name: Check Duplicates

on: [push, pull_request]

jobs:
  check-dupes:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Install Rust
        uses: actions-rs/toolchain@v1
        with:
          toolchain: stable
      
      - name: Install PolyDup
        run: cargo install --path crates/dupe-cli
      
      - name: Check for duplicates
        run: |
          polydup ./src --threshold 50 --similarity 0.85 --format json > duplicates.json
          
      - name: Upload results
        uses: actions/upload-artifact@v3
        if: failure()
        with:
          name: duplicate-report
          path: duplicates.json
```

### Pre-commit Hook

```bash
#!/bin/bash
# .git/hooks/pre-commit

echo "Checking for duplicate code..."
if ! polydup ./src --threshold 100 --similarity 0.9; then
    echo "❌ Large code duplicates detected!"
    echo "Review the duplicates above and consider refactoring."
    exit 1
fi
```

### Makefile Integration

```makefile
.PHONY: check-dupes
check-dupes:
	@echo "Scanning for duplicates..."
	@polydup ./src ./lib --threshold 50 --similarity 0.85

.PHONY: dupes-json
dupes-json:
	@polydup ./src --format json > duplicates.json
	@echo "Report saved to duplicates.json"
```

### Shell Script for Multiple Projects

```bash
#!/bin/bash
# scan-all-projects.sh

projects=(
    "project1/src"
    "project2/lib"
    "project3/backend"
)

for project in "${projects[@]}"; do
    echo "Scanning $project..."
    polydup "$project" --format json > "${project//\//-}-report.json"
done

echo "✅ All scans complete!"
```

## Performance Tuning

### Fast Scan (Lower Accuracy)

```bash
# Large block size = fewer comparisons = faster
polydup ./src --threshold 100 --similarity 0.7
```

### Thorough Scan (Higher Accuracy)

```bash
# Small block size = more comparisons = slower but catches smaller duplicates
polydup ./src --threshold 20 --similarity 0.95
```

### Recommended Settings

| Use Case | Threshold | Similarity |
|----------|-----------|------------|
| **Quick check** | 100 | 0.85 |
| **Standard scan** | 50 | 0.85 |
| **Thorough analysis** | 30 | 0.90 |
| **Refactoring prep** | 20 | 0.95 |

## Troubleshooting

### No Duplicates Found (But You Expected Some)

- **Lower the threshold**: Try `--threshold 20` to catch smaller duplicates
- **Lower similarity**: Try `--similarity 0.7` for looser matching
- **Check file types**: Only Rust, Python, and JavaScript/TypeScript are supported

### Too Many False Positives

- **Raise the threshold**: Try `--threshold 100` to only catch large duplicates
- **Raise similarity**: Try `--similarity 0.95` for stricter matching

### Slow Performance

- **Increase threshold**: Larger blocks = fewer comparisons
- **Scan fewer files**: Be more specific with paths
- **Use release build**: `cargo build --release` (already done if installed)

## Supported Languages

- **Rust**: `.rs` files
- **Python**: `.py` files  
- **JavaScript/TypeScript**: `.js`, `.jsx`, `.ts`, `.tsx` files

More languages coming soon!

## Algorithm

PolyDup uses:
1. **Tree-sitter** for AST-based parsing
2. **Token normalization** (identifiers → `$$ID`, strings → `$$STR`, numbers → `$$NUM`)
3. **Rabin-Karp rolling hash** with window size 50
4. **Parallel processing** via Rayon for multi-core performance

See [architecture-research.md](../../docs/architecture-research.md) for details.

## License

MIT OR Apache-2.0

## Links

- **Core Library**: [dupe-core]../dupe-core
- **Node.js Bindings**: [dupe-node]../dupe-node
- **Python Bindings**: [dupe-py]../dupe-py
- **GitHub**: https://github.com/wiesnerbernard/polydup