dupe-cli 0.1.0

Cross-language duplicate code detector CLI tool
dupe-cli-0.1.0 is not a library.

PolyDup CLI

Command-line interface for PolyDup, the cross-language duplicate code detector.

Installation

From Source

cd crates/dupe-cli
cargo build --release

# Binary will be at: target/release/polydup

System-wide Installation

cargo install --path crates/dupe-cli

# Or from the workspace root:
cargo install --path .

Usage

Basic Scan

polydup ./src

Scan Multiple Paths

polydup ./src ./lib ./tests

Adjust Detection Parameters

# Set minimum block size (default: 50 tokens)
polydup ./src --threshold 30

# Set similarity threshold (default: 0.85 = 85%)
polydup ./src --similarity 0.9

# Combine both
polydup ./src --threshold 30 --similarity 0.9

Output Formats

Text output (default):

polydup ./src

Output:

📊 Scan Results
═══════════════════════════════════════════════════════════
Files scanned:      4
Functions analyzed: 45
Duplicates found:   0

✅ No duplicates found!

JSON output (for scripting):

polydup ./src --format json

Output:

{
  "files_scanned": 4,
  "functions_analyzed": 45,
  "duplicates": [],
  "stats": {
    "total_lines": 0,
    "total_tokens": 3665,
    "unique_hashes": 2666,
    "duration_ms": 8
  }
}

Verbose Mode

Show additional performance metrics:

polydup ./src --verbose

Output includes:

  • Total tokens processed
  • Number of unique hashes
  • Scan duration

Command-Line Options

polydup [OPTIONS] <PATHS>...

Arguments:
  <PATHS>...  Paths to scan (files or directories)

Options:
  -f, --format <FORMAT>
          Output format [default: text] [possible values: text, json]
  
  -t, --threshold <MIN_BLOCK_SIZE>
          Minimum code block size in tokens [default: 50]
  
  -s, --similarity <SIMILARITY>
          Similarity threshold (0.0-1.0) [default: 0.85]
  
  -v, --verbose
          Show verbose output
  
  -h, --help
          Print help
  
  -V, --version
          Print version

Exit Codes

  • 0: No duplicates found
  • 1: Duplicates found (or error occurred)

This allows usage in CI/CD pipelines:

#!/bin/bash
if ! polydup ./src --threshold 100; then
    echo "❌ Duplicates detected!"
    exit 1
fi
echo "✅ No duplicates!"

Examples

CI/CD Integration

GitHub Actions:

name: Check Duplicates

on: [push, pull_request]

jobs:
  check-dupes:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Install Rust
        uses: actions-rs/toolchain@v1
        with:
          toolchain: stable
      
      - name: Install PolyDup
        run: cargo install --path crates/dupe-cli
      
      - name: Check for duplicates
        run: |
          polydup ./src --threshold 50 --similarity 0.85 --format json > duplicates.json
          
      - name: Upload results
        uses: actions/upload-artifact@v3
        if: failure()
        with:
          name: duplicate-report
          path: duplicates.json

Pre-commit Hook

#!/bin/bash
# .git/hooks/pre-commit

echo "Checking for duplicate code..."
if ! polydup ./src --threshold 100 --similarity 0.9; then
    echo "❌ Large code duplicates detected!"
    echo "Review the duplicates above and consider refactoring."
    exit 1
fi

Makefile Integration

.PHONY: check-dupes
check-dupes:
	@echo "Scanning for duplicates..."
	@polydup ./src ./lib --threshold 50 --similarity 0.85

.PHONY: dupes-json
dupes-json:
	@polydup ./src --format json > duplicates.json
	@echo "Report saved to duplicates.json"

Shell Script for Multiple Projects

#!/bin/bash
# scan-all-projects.sh

projects=(
    "project1/src"
    "project2/lib"
    "project3/backend"
)

for project in "${projects[@]}"; do
    echo "Scanning $project..."
    polydup "$project" --format json > "${project//\//-}-report.json"
done

echo "✅ All scans complete!"

Performance Tuning

Fast Scan (Lower Accuracy)

# Large block size = fewer comparisons = faster
polydup ./src --threshold 100 --similarity 0.7

Thorough Scan (Higher Accuracy)

# Small block size = more comparisons = slower but catches smaller duplicates
polydup ./src --threshold 20 --similarity 0.95

Recommended Settings

Use Case Threshold Similarity
Quick check 100 0.85
Standard scan 50 0.85
Thorough analysis 30 0.90
Refactoring prep 20 0.95

Troubleshooting

No Duplicates Found (But You Expected Some)

  • Lower the threshold: Try --threshold 20 to catch smaller duplicates
  • Lower similarity: Try --similarity 0.7 for looser matching
  • Check file types: Only Rust, Python, and JavaScript/TypeScript are supported

Too Many False Positives

  • Raise the threshold: Try --threshold 100 to only catch large duplicates
  • Raise similarity: Try --similarity 0.95 for stricter matching

Slow Performance

  • Increase threshold: Larger blocks = fewer comparisons
  • Scan fewer files: Be more specific with paths
  • Use release build: cargo build --release (already done if installed)

Supported Languages

  • Rust: .rs files
  • Python: .py files
  • JavaScript/TypeScript: .js, .jsx, .ts, .tsx files

More languages coming soon!

Algorithm

PolyDup uses:

  1. Tree-sitter for AST-based parsing
  2. Token normalization (identifiers → $$ID, strings → $$STR, numbers → $$NUM)
  3. Rabin-Karp rolling hash with window size 50
  4. Parallel processing via Rayon for multi-core performance

See architecture-research.md for details.

License

MIT OR Apache-2.0

Links