polydup-0.8.1 is not a library.

PolyDup CLI

Command-line interface for PolyDup, the cross-language duplicate code detector.

Installation

From Source

cd crates/polydup-cli
cargo build --release

# Binary will be at: target/release/polydup

System-wide Installation

cargo install --path crates/polydup-cli

# Or from the workspace root:
cargo install --path .

Usage

Basic Scan

polydup ./src

Scan Multiple Paths

polydup ./src ./lib ./tests

Adjust Detection Parameters

# Set minimum block size (default: 50 tokens)
polydup ./src --threshold 30

# Set similarity threshold (default: 0.85 = 85%)
polydup ./src --similarity 0.9

# Combine both
polydup ./src --threshold 30 --similarity 0.9

Exclude Files (e.g., Tests)

By default, PolyDup excludes common test file patterns:

**/*.test.{ts,js,tsx,jsx}
**/*.spec.{ts,js,tsx,jsx}
**/__tests__/**
**/*.test.py

To use custom exclusions (replaces defaults):

# Exclude specific patterns
polydup ./src --exclude "**/*.generated.ts" --exclude "**/*.mock.js"

# Exclude multiple patterns
polydup ./src -e "**/*.test.ts" -e "**/*.spec.js" -e "**/fixtures/**"

# No exclusions (scan everything including tests)
polydup ./src --exclude ""

Output Formats

Text output (default):

polydup ./src

Output:

Scan Results
═══════════════════════════════════════════════════════════
Files scanned:      4
Functions analyzed: 45
Duplicates found:   0

No duplicates found!

JSON output (for scripting):

polydup ./src --format json

Output:

{
  "files_scanned": 4,
  "functions_analyzed": 45,
  "duplicates": [],
  "stats": {
    "total_lines": 0,
    "total_tokens": 3665,
    "unique_hashes": 2666,
    "duration_ms": 8
  }
}

Verbose Mode

Show additional performance metrics:

polydup ./src --verbose

Output includes:

Total tokens processed
Number of unique hashes
Scan duration

Command-Line Options

polydup [OPTIONS] <PATHS>...

Arguments:
  <PATHS>...  Paths to scan (files or directories)

Options:
  -f, --format <FORMAT>
          Output format [default: text] [possible values: text, json]

  -t, --threshold <MIN_BLOCK_SIZE>
          Minimum code block size in tokens [default: 50]

  -s, --similarity <SIMILARITY>
          Similarity threshold (0.0-1.0) [default: 0.85]

  -v, --verbose
          Show verbose output

  -h, --help
          Print help

  -V, --version
          Print version

Managing False Positives

PolyDup provides an ignore system to suppress false positives while keeping them documented.

Adding Ignore Entries

Add a duplicate to the ignore list:

# Add by ID (from scan output)
polydup ignore add abc123def --files "src/utils.rs:10-30,src/helpers.rs:45-65" --reason "Intentional code reuse"

# Interactive mode
polydup ignore add
# You'll be prompted for files and reason

Listing Ignored Duplicates

# List all ignored duplicates
polydup ignore list

# Verbose output (shows file paths)
polydup ignore list --verbose

# JSON output for scripting
polydup ignore list --format json

Example output:

Ignored Duplicates (2)

1. abc123def456
   Reason: Boilerplate initialization code
   Added by: alice
   Added at: 2025-12-26 10:30:15 UTC
   Files: 2 file(s)

2. xyz789abc123
   Reason: Required by framework convention
   Added by: bob
   Added at: 2025-12-26 11:45:30 UTC
   Files: 3 file(s)

Removing Ignore Entries

# Remove by ID
polydup ignore remove abc123def456

Ignore File Format

Ignored duplicates are stored in .polydup-ignore (TOML format):

version = 1

[[ignores]]
id = "abc123def456"
reason = "Intentional code reuse"
added_by = "alice"
added_at = "2025-12-26T10:30:15Z"

[[ignores.files]]
file = "src/utils.rs"
start_line = 10
end_line = 30

[[ignores.files]]
file = "src/helpers.rs"
start_line = 45
end_line = 65

Tip: Commit .polydup-ignore to version control to share ignore decisions with your team!

Git-Diff Mode (PR Review)

Scan only files changed in a git diff range - perfect for PR checks:

# Scan files changed in current branch vs main
polydup scan . --git-diff origin/main..HEAD

# Scan files changed in last commit
polydup scan . --git-diff HEAD~1..HEAD

# Scan with custom similarity threshold
polydup scan . --git-diff main..feature-branch --similarity 0.9

How It Works

Fast: Only scans files in your diff (10-100x faster for large repos)
Smart: Scans entire codebase but reports only duplicates involving changed files
Accurate: Detects when changed code duplicates with unchanged code

Example Output:

ℹ Git-Diff Mode: Only scanning files changed in origin/main..HEAD
  Git diff filter: Added 3 file(s) -> Modified/Renamed 2 file(s)
  Changed files (2):
    • src/handler.rs
    • src/utils.rs

  Git-diff filter: 4 duplicate(s) involve changed files

Combined with Ignore Rules and Directives

Git-diff mode works seamlessly with ignore management:

# PR check with directives
polydup scan . --git-diff origin/main..HEAD --enable-directives

# PR check with ignore rules loaded from .polydup-ignore
polydup scan . --git-diff HEAD~1..HEAD --verbose

CI/CD Example:

# .github/workflows/pr-check.yml
- name: Check for duplicates in PR
  run: polydup scan . --git-diff origin/${{ github.base_ref }}..HEAD
  # Only fails if new duplicates introduced in this PR

Benefits:

✅ Focuses review on relevant changes
✅ Respects existing ignore rules
✅ Works with inline directives
✅ No baseline files to manage

Exit Codes

0: No duplicates found
1: Duplicates found (or error occurred)

This allows usage in CI/CD pipelines:

#!/bin/bash
if ! polydup ./src --threshold 100; then
    echo "❌ Duplicates detected!"
    exit 1
fi
echo "No duplicates!"

Examples

CI/CD Integration

GitHub Actions:

name: Check Duplicates

on: [push, pull_request]

jobs:
  check-dupes:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Install Rust
        uses: actions-rs/toolchain@v1
        with:
          toolchain: stable

      - name: Install PolyDup
        run: cargo install --path crates/polydup-cli

      - name: Check for duplicates
        run: |
          polydup ./src --threshold 50 --similarity 0.85 --format json > duplicates.json

      - name: Upload results
        uses: actions/upload-artifact@v3
        if: failure()
        with:
          name: duplicate-report
          path: duplicates.json

Pre-commit Hook

#!/bin/bash
# .git/hooks/pre-commit

echo "Checking for duplicate code..."
if ! polydup ./src --threshold 100 --similarity 0.9; then
    echo "❌ Large code duplicates detected!"
    echo "Review the duplicates above and consider refactoring."
    exit 1
fi

Makefile Integration

.PHONY: check-dupes
check-dupes:
	@echo "Scanning for duplicates..."
	@polydup ./src ./lib --threshold 50 --similarity 0.85

.PHONY: dupes-json
dupes-json:
	@polydup ./src --format json > duplicates.json
	@echo "Report saved to duplicates.json"

Shell Script for Multiple Projects

#!/bin/bash
# scan-all-projects.sh

projects=(
    "project1/src"
    "project2/lib"
    "project3/backend"
)

for project in "${projects[@]}"; do
    echo "Scanning $project..."
    polydup "$project" --format json > "${project//\//-}-report.json"
done

echo "All scans complete!"

Performance Tuning

Fast Scan (Lower Accuracy)

# Large block size = fewer comparisons = faster
polydup ./src --threshold 100 --similarity 0.7

Thorough Scan (Higher Accuracy)

# Small block size = more comparisons = slower but catches smaller duplicates
polydup ./src --threshold 20 --similarity 0.95

Recommended Settings

Use Case	Threshold	Similarity
Quick check	100	0.85
Standard scan	50	0.85
Thorough analysis	30	0.90
Refactoring prep	20	0.95

Troubleshooting

No Duplicates Found (But You Expected Some)

Lower the threshold: Try --threshold 20 to catch smaller duplicates
Lower similarity: Try --similarity 0.7 for looser matching
Check file types: Only Rust, Python, and JavaScript/TypeScript are supported

Too Many False Positives

Raise the threshold: Try --threshold 100 to only catch large duplicates
Raise similarity: Try --similarity 0.95 for stricter matching

Slow Performance

Increase threshold: Larger blocks = fewer comparisons
Scan fewer files: Be more specific with paths
Use release build: cargo build --release (already done if installed)

Supported Languages

Rust: .rs files
Python: .py files
JavaScript/TypeScript: .js, .jsx, .ts, .tsx files

More languages coming soon!

Algorithm

PolyDup uses:

Tree-sitter for AST-based parsing
Token normalization (identifiers → $$ID, strings → $$STR, numbers → $$NUM)
Rabin-Karp rolling hash with window size 50
Parallel processing via Rayon for multi-core performance

See architecture-research.md for details.

License

MIT OR Apache-2.0

polydup 0.8.1