dupe-cli-0.1.0 is not a library.

PolyDup CLI

Command-line interface for PolyDup, the cross-language duplicate code detector.

Installation

From Source

cd crates/dupe-cli
cargo build --release

# Binary will be at: target/release/polydup

System-wide Installation

cargo install --path crates/dupe-cli

# Or from the workspace root:
cargo install --path .

Usage

Basic Scan

polydup ./src

Scan Multiple Paths

polydup ./src ./lib ./tests

Adjust Detection Parameters

# Set minimum block size (default: 50 tokens)
polydup ./src --threshold 30

# Set similarity threshold (default: 0.85 = 85%)
polydup ./src --similarity 0.9

# Combine both
polydup ./src --threshold 30 --similarity 0.9

Output Formats

Text output (default):

polydup ./src

Output:

📊 Scan Results
═══════════════════════════════════════════════════════════
Files scanned:      4
Functions analyzed: 45
Duplicates found:   0

✅ No duplicates found!

JSON output (for scripting):

polydup ./src --format json

Output:

{
  "files_scanned": 4,
  "functions_analyzed": 45,
  "duplicates": [],
  "stats": {
    "total_lines": 0,
    "total_tokens": 3665,
    "unique_hashes": 2666,
    "duration_ms": 8
  }
}

Verbose Mode

Show additional performance metrics:

polydup ./src --verbose

Output includes:

Total tokens processed
Number of unique hashes
Scan duration

Command-Line Options

polydup [OPTIONS] <PATHS>...

Arguments:
  <PATHS>...  Paths to scan (files or directories)

Options:
  -f, --format <FORMAT>
          Output format [default: text] [possible values: text, json]
  
  -t, --threshold <MIN_BLOCK_SIZE>
          Minimum code block size in tokens [default: 50]
  
  -s, --similarity <SIMILARITY>
          Similarity threshold (0.0-1.0) [default: 0.85]
  
  -v, --verbose
          Show verbose output
  
  -h, --help
          Print help
  
  -V, --version
          Print version

Exit Codes

0: No duplicates found
1: Duplicates found (or error occurred)

This allows usage in CI/CD pipelines:

#!/bin/bash
if ! polydup ./src --threshold 100; then
    echo "❌ Duplicates detected!"
    exit 1
fi
echo "✅ No duplicates!"

Examples

CI/CD Integration

GitHub Actions:

name: Check Duplicates

on: [push, pull_request]

jobs:
  check-dupes:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Install Rust
        uses: actions-rs/toolchain@v1
        with:
          toolchain: stable
      
      - name: Install PolyDup
        run: cargo install --path crates/dupe-cli
      
      - name: Check for duplicates
        run: |
          polydup ./src --threshold 50 --similarity 0.85 --format json > duplicates.json
          
      - name: Upload results
        uses: actions/upload-artifact@v3
        if: failure()
        with:
          name: duplicate-report
          path: duplicates.json

Pre-commit Hook

#!/bin/bash
# .git/hooks/pre-commit

echo "Checking for duplicate code..."
if ! polydup ./src --threshold 100 --similarity 0.9; then
    echo "❌ Large code duplicates detected!"
    echo "Review the duplicates above and consider refactoring."
    exit 1
fi

Makefile Integration

.PHONY: check-dupes
check-dupes:
	@echo "Scanning for duplicates..."
	@polydup ./src ./lib --threshold 50 --similarity 0.85

.PHONY: dupes-json
dupes-json:
	@polydup ./src --format json > duplicates.json
	@echo "Report saved to duplicates.json"

Shell Script for Multiple Projects

#!/bin/bash
# scan-all-projects.sh

projects=(
    "project1/src"
    "project2/lib"
    "project3/backend"
)

for project in "${projects[@]}"; do
    echo "Scanning $project..."
    polydup "$project" --format json > "${project//\//-}-report.json"
done

echo "✅ All scans complete!"

Performance Tuning

Fast Scan (Lower Accuracy)

# Large block size = fewer comparisons = faster
polydup ./src --threshold 100 --similarity 0.7

Thorough Scan (Higher Accuracy)

# Small block size = more comparisons = slower but catches smaller duplicates
polydup ./src --threshold 20 --similarity 0.95

Recommended Settings

Use Case	Threshold	Similarity
Quick check	100	0.85
Standard scan	50	0.85
Thorough analysis	30	0.90
Refactoring prep	20	0.95

Troubleshooting

No Duplicates Found (But You Expected Some)

Lower the threshold: Try --threshold 20 to catch smaller duplicates
Lower similarity: Try --similarity 0.7 for looser matching
Check file types: Only Rust, Python, and JavaScript/TypeScript are supported

Too Many False Positives

Raise the threshold: Try --threshold 100 to only catch large duplicates
Raise similarity: Try --similarity 0.95 for stricter matching

Slow Performance

Increase threshold: Larger blocks = fewer comparisons
Scan fewer files: Be more specific with paths
Use release build: cargo build --release (already done if installed)

Supported Languages

Rust: .rs files
Python: .py files
JavaScript/TypeScript: .js, .jsx, .ts, .tsx files

More languages coming soon!

Algorithm

PolyDup uses:

Tree-sitter for AST-based parsing
Token normalization (identifiers → $$ID, strings → $$STR, numbers → $$NUM)
Rabin-Karp rolling hash with window size 50
Parallel processing via Rayon for multi-core performance

See architecture-research.md for details.

License

MIT OR Apache-2.0

dupe-cli 0.1.0