PolyDup CLI
Command-line interface for PolyDup, the cross-language duplicate code detector.
Installation
From Source
# Binary will be at: target/release/polydup
System-wide Installation
# Or from the workspace root:
Usage
Basic Scan
Scan Multiple Paths
Adjust Detection Parameters
# Set minimum block size (default: 50 tokens)
# Set similarity threshold (default: 0.85 = 85%)
# Combine both
Exclude Files (e.g., Tests)
By default, PolyDup excludes common test file patterns:
**/*.test.{ts,js,tsx,jsx}**/*.spec.{ts,js,tsx,jsx}**/__tests__/****/*.test.py
To use custom exclusions (replaces defaults):
# Exclude specific patterns
# Exclude multiple patterns
# No exclusions (scan everything including tests)
Output Formats
Text output (default):
Output:
Scan Results
═══════════════════════════════════════════════════════════
Files scanned: 4
Functions analyzed: 45
Duplicates found: 0
No duplicates found!
JSON output (for scripting):
Output:
Verbose Mode
Show additional performance metrics:
Output includes:
- Total tokens processed
- Number of unique hashes
- Scan duration
Command-Line Options
polydup [OPTIONS] <PATHS>...
Arguments:
<PATHS>... Paths to scan (files or directories)
Options:
-f, --format <FORMAT>
Output format [default: text] [possible values: text, json]
-t, --threshold <MIN_BLOCK_SIZE>
Minimum code block size in tokens [default: 50]
-s, --similarity <SIMILARITY>
Similarity threshold (0.0-1.0) [default: 0.85]
-v, --verbose
Show verbose output
-h, --help
Print help
-V, --version
Print version
Managing False Positives
PolyDup provides an ignore system to suppress false positives while keeping them documented.
Adding Ignore Entries
Add a duplicate to the ignore list:
# Add by ID (from scan output)
# Interactive mode
# You'll be prompted for files and reason
Listing Ignored Duplicates
# List all ignored duplicates
# Verbose output (shows file paths)
# JSON output for scripting
Example output:
Ignored Duplicates (2)
1. abc123def456
Reason: Boilerplate initialization code
Added by: alice
Added at: 2025-12-26 10:30:15 UTC
Files: 2 file(s)
2. xyz789abc123
Reason: Required by framework convention
Added by: bob
Added at: 2025-12-26 11:45:30 UTC
Files: 3 file(s)
Removing Ignore Entries
# Remove by ID
Ignore File Format
Ignored duplicates are stored in .polydup-ignore (TOML format):
= 1
[[]]
= "abc123def456"
= "Intentional code reuse"
= "alice"
= "2025-12-26T10:30:15Z"
[[]]
= "src/utils.rs"
= 10
= 30
[[]]
= "src/helpers.rs"
= 45
= 65
Tip: Commit .polydup-ignore to version control to share ignore decisions with your team!
Git-Diff Mode (PR Review)
Scan only files changed in a git diff range - perfect for PR checks:
# Scan files changed in current branch vs main
# Scan files changed in last commit
# Scan with custom similarity threshold
How It Works
- Fast: Only scans files in your diff (10-100x faster for large repos)
- Smart: Scans entire codebase but reports only duplicates involving changed files
- Accurate: Detects when changed code duplicates with unchanged code
Example Output:
) )
)
)
Combined with Ignore Rules and Directives
Git-diff mode works seamlessly with ignore management:
# PR check with directives
# PR check with ignore rules loaded from .polydup-ignore
CI/CD Example:
# .github/workflows/pr-check.yml
- name: Check for duplicates in PR
run: polydup scan . --git-diff origin/${{ github.base_ref }}..HEAD
# Only fails if new duplicates introduced in this PR
Benefits:
- ✅ Focuses review on relevant changes
- ✅ Respects existing ignore rules
- ✅ Works with inline directives
- ✅ No baseline files to manage
Exit Codes
- 0: No duplicates found
- 1: Duplicates found (or error occurred)
This allows usage in CI/CD pipelines:
#!/bin/bash
if ! ; then
fi
Examples
CI/CD Integration
GitHub Actions:
name: Check Duplicates
on:
jobs:
check-dupes:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install Rust
uses: actions-rs/toolchain@v1
with:
toolchain: stable
- name: Install PolyDup
run: cargo install --path crates/polydup-cli
- name: Check for duplicates
run: |
polydup ./src --threshold 50 --similarity 0.85 --format json > duplicates.json
- name: Upload results
uses: actions/upload-artifact@v3
if: failure()
with:
name: duplicate-report
path: duplicates.json
Pre-commit Hook
#!/bin/bash
# .git/hooks/pre-commit
if ! ; then
fi
Makefile Integration
: :
: :
Shell Script for Multiple Projects
#!/bin/bash
# scan-all-projects.sh
projects=(
"project1/src"
"project2/lib"
"project3/backend"
)
for; do
done
Performance Tuning
Fast Scan (Lower Accuracy)
# Large block size = fewer comparisons = faster
Thorough Scan (Higher Accuracy)
# Small block size = more comparisons = slower but catches smaller duplicates
Recommended Settings
| Use Case | Threshold | Similarity |
|---|---|---|
| Quick check | 100 | 0.85 |
| Standard scan | 50 | 0.85 |
| Thorough analysis | 30 | 0.90 |
| Refactoring prep | 20 | 0.95 |
Troubleshooting
No Duplicates Found (But You Expected Some)
- Lower the threshold: Try
--threshold 20to catch smaller duplicates - Lower similarity: Try
--similarity 0.7for looser matching - Check file types: Only Rust, Python, and JavaScript/TypeScript are supported
Too Many False Positives
- Raise the threshold: Try
--threshold 100to only catch large duplicates - Raise similarity: Try
--similarity 0.95for stricter matching
Slow Performance
- Increase threshold: Larger blocks = fewer comparisons
- Scan fewer files: Be more specific with paths
- Use release build:
cargo build --release(already done if installed)
Supported Languages
- Rust:
.rsfiles - Python:
.pyfiles - JavaScript/TypeScript:
.js,.jsx,.ts,.tsxfiles
More languages coming soon!
Algorithm
PolyDup uses:
- Tree-sitter for AST-based parsing
- Token normalization (identifiers →
$$ID, strings →$$STR, numbers →$$NUM) - Rabin-Karp rolling hash with window size 50
- Parallel processing via Rayon for multi-core performance
See architecture-research.md for details.
License
MIT OR Apache-2.0
Links
- Core Library: polydup-core
- Node.js Bindings: polydup-node
- Python Bindings: polydup-py
- GitHub: https://github.com/wiesnerbernard/polydup