PolyDup
Cross-language duplicate code detector powered by Tree-sitter and Rust.
Features
- Blazing Fast: Parallel processing with Rabin-Karp rolling hash algorithm (up to 10x faster than regex-based detectors)
- Cross-Language: JavaScript, TypeScript, Python, Rust, Vue, Svelte (more coming)
- Accurate: Tree-sitter AST parsing for semantic-aware detection (eliminates false positives from comments/strings)
- Multi-Platform: CLI, Node.js npm package, Python pip package, Rust library
- Configurable: Adjust thresholds and block sizes for your needs
- Efficient: Zero-copy FFI bindings for minimal overhead (passes file paths, not contents)
Architecture
Shared Core Architecture: All duplicate detection logic lives in Rust, exposed via FFI bindings.
┌─────────────────────────────────────────────┐
│ polydup-core (Rust) │
│ • Tree-sitter parsing │
│ • Rabin-Karp hashing │
│ • Parallel file scanning │
│ • Duplicate detection │
└─────────────────────────────────────────────┘
▲ ▲ ▲
│ │ │
┌─────┴───┐ ┌───┴────┐ ┌─┴─────┐
│ CLI │ │ Node.js│ │ Python│
│ (Rust) │ │(napi-rs)│ │(PyO3) │
└─────────┘ └────────┘ └───────┘
Crates:
- polydup-core: Pure Rust library with Tree-sitter parsing, hashing, and reporting
- polydup (CLI): Standalone CLI tool (
cargo install polydup) - polydup-node: Node.js library bindings via napi-rs (
npm install @polydup/core) - polydup-py: Python library bindings via PyO3 (
pip install polydup)
Installation
Important: PolyDup is available in multiple forms for different use cases:
- CLI Tool:
cargo install polydup- Command-line scanning- Python Library:
pip install polydup- Python API bindings (NOT a CLI)- Node.js Library:
npm install @polydup/core- Node.js API bindings (NOT a CLI)If you want to run
polydupfrom the command line, usecargo install polydup.
GitHub Action (Easiest for CI/CD) 🚀
The fastest way to add duplicate detection to your workflow:
name: Code Quality
on:
pull_request:
branches:
permissions:
contents: read
pull-requests: write # Required for PR comments
jobs:
duplicate-detection:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Required for git-diff mode
- uses: wiesnerbernard/polydup-action@v0.2.1
with:
threshold: 50
similarity: '0.85'
fail-on-duplicates: true
github-token: ${{ secrets.GITHUB_TOKEN }}
Benefits:
- 🚀 10-100x faster (only scans changed files in PR)
- 💬 Automatic PR comments with duplicate reports
- ✅ Zero configuration needed
- 🔒 Secure (no data leaves your repository)
Action Inputs
| Input | Default | Description |
|---|---|---|
threshold |
50 |
Minimum code block size in tokens |
similarity |
0.85 |
Similarity threshold (0.0-1.0) |
fail-on-duplicates |
true |
Fail the check if duplicates found |
format |
text |
Output format: text or json |
base-ref |
auto | Base git reference (auto-detects from PR) |
github-token |
- | Token for PR comments |
comment-on-pr |
true |
Post results as PR comment |
Action Outputs
| Output | Description |
|---|---|
duplicates-found |
Number of duplicate code blocks found |
files-scanned |
Number of files scanned |
exit-code |
Exit code (0 = no duplicates, 1 = duplicates) |
Using Outputs in Workflows
- uses: wiesnerbernard/polydup-action@v0.2.1
id: polydup
with:
fail-on-duplicates: false
github-token: ${{ secrets.GITHUB_TOKEN }}
- name: Check results
run: |
echo "Files scanned: ${{ steps.polydup.outputs.files-scanned }}"
echo "Duplicates found: ${{ steps.polydup.outputs.duplicates-found }}"
if [ "${{ steps.polydup.outputs.duplicates-found }}" -gt 10 ]; then
echo "Too many duplicates!"
exit 1
fi
Example PR Comment
When duplicates are found, the action posts a comment like:
## PolyDup Duplicate Code Report
**Found 3 duplicate code block(s)**
- Files scanned: 12
- Threshold: 50 tokens
- Similarity: 0.85
<details>
<summary>View Details</summary>
[Detailed scan output...]
</details>
**Tip**: Consider refactoring duplicated code to improve maintainability.
See polydup-action for full documentation.
Rust CLI (Recommended for Local Development)
The fastest way to use PolyDup locally is via the CLI tool:
# Install from crates.io
# Verify installation
# Scan for duplicates
System Requirements:
- Rust 1.70+ (if building from source)
- macOS, Linux, or Windows
Note: Homebrew tap coming soon! (
brew install polydup)
Pre-built Binaries:
Download pre-compiled binaries from GitHub Releases:
# macOS (Apple Silicon)
# macOS (Intel)
# Linux (x86_64)
# Linux (x86_64 static - musl)
# Windows (x86_64)
# Download polydup-windows-x86_64.exe from releases page and add to PATH
Node.js/npm (Library Only)
Note: This is a library package for integrating duplicate detection into Node.js applications. It does NOT provide a CLI. For command-line usage, use
cargo install polydup.
Install as a project dependency:
Requirements: Node.js 16+ on macOS (Intel/ARM), Windows (x64), or Linux (x64)
Usage:
const = require;
const duplicates = ;
console.log;
duplicates.;
Python/pip (Library Only)
Note: This is a library package for integrating duplicate detection into Python applications. It does NOT provide a CLI. For command-line usage, use
cargo install polydup.Running
python -m polydupwill display installation guidance.
Install from PyPI:
# Using pip
# Using uv (recommended for faster installs)
Requirements: Python 3.8-3.12 on macOS (Intel/ARM), Windows (x64), or Linux (x64)
Usage:
# Scan for duplicates
=
Rust Library
Use the core library in your Rust project:
[]
= "0.1"
use ;
use PathBuf;
Building from Source
CLI
Node.js
Python
CLI Usage
Quick Start with polydup init
The fastest way to get started is with the interactive initialization wizard:
# Run the initialization wizard
# Non-interactive mode (use defaults)
# Force overwrite existing configuration
The wizard will:
- Auto-detect your project environment (Node.js, Rust, Python, etc.)
- Generate
.polyduprc.tomlwith environment-specific defaults - Create GitHub Actions workflow (optional)
- Show install instructions tailored to your environment
- Provide next steps for local usage
Example workflow:
=============================
)
)
Configuration File (.polyduprc.toml)
After running polydup init, you'll have a .polyduprc.toml file:
[]
= 50
= 0.85
[]
= [
"**/node_modules/**",
"**/__pycache__/**",
"**/*.test.js",
"**/*.test.py",
]
[]
= "text"
= false
[]
= false
= true
Configuration Discovery:
- PolyDup searches for
.polyduprc.tomlin current directory and parent directories - CLI arguments override config file settings
- Perfect for monorepos with shared configuration at root
Basic Commands
# Scan a directory
# Scan multiple directories
# Custom threshold (0.0-1.0, higher = stricter)
# Adjust minimum block size (lines)
# JSON output for scripting
Examples
Quick scan for severe duplicates:
Deep scan for similar code:
Scan specific file types:
# PolyDup auto-detects: .rs, .js, .ts, .jsx, .tsx, .py, .vue, .svelte
CI/CD integration:
# Exit with error if duplicates found
||
Output Formats
Text (default): Human-readable colored output with file paths, line numbers, and similarity scores
JSON: Machine-readable format for scripting and tooling integration
|
Commands
PolyDup supports the following subcommands:
| Command | Description | Example |
|---|---|---|
scan |
Scan for duplicate code (default command) | polydup scan ./src |
init |
Interactive setup wizard | polydup init |
Scan Command Options:
The scan command accepts all options listed below. When no subcommand is specified, scan is assumed for backward compatibility.
# These are equivalent:
Init Command Options:
| Option | Description |
|---|---|
--yes, -y |
Skip interactive prompts, use defaults |
--force |
Overwrite existing .polyduprc.toml |
CLI Options
| Option | Type | Default | Description |
|---|---|---|---|
--threshold |
float | 0.9 | Similarity threshold (0.0-1.0) |
--min-block-size |
int | 10 | Minimum lines per code block |
--format |
text|json | text | Output format |
--output |
path | - | Write report to file |
--only-type |
types | - | Filter by clone type (type-1, type-2, type-3) |
--exclude-type |
types | - | Exclude clone types |
--group-by |
criterion | - | Group results (file, similarity, type, size) |
--verbose |
flag | false | Show performance statistics |
--no-color |
flag | false | Disable colored output |
--debug |
flag | false | Enable debug mode with detailed traces |
--enable-type3 |
flag | false | Enable Type-3 gap-tolerant detection |
--save-baseline |
path | - | Save scan results as baseline for future comparisons |
--compare-to |
path | - | Compare against baseline (show only new duplicates) |
--git-diff |
range | - | Only scan files changed in git diff range (e.g., origin/main..HEAD) ⚡ Recommended for CI |
Performance Tip: For large codebases (>50K LOC), increase --min-block-size to 20-50 for faster scans with less noise.
Baseline/Snapshot Mode
The most powerful feature for CI/CD: Block new duplicates without failing on legacy code.
Use Case: "We have existing duplicates, but block any NEW ones"
Many codebases have legacy duplication that's not worth fixing immediately. Baseline mode lets you:
- ✅ Accept existing duplicates as-is
- ✅ Fail CI/CD only when new duplicates are introduced
- ✅ Gradually reduce technical debt without blocking development
Quick Start
Step 1: Create baseline from your main branch
# On main/master branch: capture current state
Step 2: Use in CI/CD to block new duplicates
# .github/workflows/polydup.yml
- name: Check for new duplicates
run: |
polydup scan ./src --compare-to .polydup-baseline.json
# Exits with code 1 if NEW duplicates found
# Exits with code 0 if no new duplicates (CI passes)
Step 3: See it in action on a PR
# Developer adds duplicate code in feature branch
Output:
ℹ Comparing against baseline: .polydup-baseline.json
11 total duplicates, 3 new since baseline
Duplicates
═══════════════════════════════════════════
1. Type-2 (renamed) | Similarity: 100.0% | Length: 59 tokens
├─ src/new-feature.ts:12
└─ src/utils.ts:45
❌ 3 new duplicates found since baseline
Exit code: 1 (CI fails, PR blocked)
Advanced Baseline Workflows
Incremental improvement: Update baseline after cleanup
# Team cleans up 10 duplicates
Combining with filters
# Save baseline excluding Type-3 (noisy matches)
# Only block new Type-1 and Type-2 duplicates
Manual review mode
# See what duplicates are NEW (no CI failure, just info)
|
Real-world Example: PR Comments
Use with GitHub Actions to comment on PRs:
- name: Check duplicates
id: polydup
run: |
OUTPUT=$(polydup scan ./src --compare-to .polydup-baseline.json --format json || true)
NEW_COUNT=$(echo "$OUTPUT" | jq '.duplicates | length')
echo "new_duplicates=$NEW_COUNT" >> $GITHUB_OUTPUT
- name: Comment on PR
if: steps.polydup.outputs.new_duplicates > 0
uses: actions/github-script@v7
with:
script: |
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: '⚠️ This PR introduces ${{ steps.polydup.outputs.new_duplicates }} new code duplicates. Please refactor before merging.'
})
Git-Diff Mode 🚀 RECOMMENDED FOR CI/CD
The fastest, simplest way to check for duplicates in Pull Requests.
Why Git-Diff Mode?
Advantages over Baseline Mode:
- ✅ 10-100x faster - Only scans files changed in the diff
- ✅ No file management - No baseline file to commit/sync
- ✅ Universal - Works on all CI platforms (GitHub, GitLab, Jenkins, etc.)
- ✅ Simpler - Just specify a git range, no baseline setup needed
- ✅ Accurate - Works in shallow clones (common in CI environments)
Quick Start
Single command to check duplicates in a PR:
# Scan only files changed between main and current branch
CI/CD Integration (GitHub Actions):
# .github/workflows/polydup.yml
name: Check for Duplicates
on:
pull_request:
branches:
jobs:
duplicate-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Need full history for git diff
- name: Install PolyDup
run: cargo install polydup
- name: Check for duplicates in PR
run: |
polydup scan . --git-diff origin/main..HEAD
# Exits with code 1 if duplicates found in the diff
# Exits with code 0 if no duplicates (CI passes)
Common Usage Patterns
1. Check uncommitted changes:
2. Compare branches:
3. Check last N commits:
4. JSON output for tooling:
How It Works
- Runs
git diff --name-only --diff-filter=ACMR <range> - Gets list of Added, Copied, Modified, Renamed files
- Filters out deleted files (can't scan what doesn't exist)
- Scans only those files for duplicates
- Exits with code 1 if duplicates found, 0 otherwise
Real-World CI Example
# Before: Scanning entire codebase (50K LOC, 500 files)
# After: Git-diff mode (PR with 5 changed files)
10-100x speedup on large codebases with focused PRs!
Edge Cases Handled
- ✅ Deleted files - Automatically filtered out (can't scan deleted code)
- ✅ Renamed files - Detected via
--diff-filter=R, scanned correctly - ✅ Shallow clones - Works in CI environments with
fetch-depth: 0 - ✅ Invalid ranges - Clear error message with suggestions
When to Use Git-Diff vs Baseline
Use Git-Diff Mode (recommended):
- ✅ Pull Request checks in CI/CD
- ✅ Fast feedback on code changes
- ✅ Git-based workflows
Use Baseline Mode when:
- ✅ Non-git workflows (Perforce, SVN, etc.)
- ✅ Tracking historical debt reduction
- ✅ Explicit acceptance of legacy duplicates
Advanced Features
Filtering by Clone Type
Focus on specific types of duplicates for targeted refactoring:
# Show only exact duplicates (highest priority)
# Show only renamed duplicates
# Show both Type-1 and Type-2
# Exclude noisy Type-3 matches
Use cases:
--only-type type-1: Quick wins for immediate refactoring--only-type type-2: Identify abstraction opportunities--exclude-type type-3: Reduce false positives in large codebases
Grouping Results
Organize duplicates for different workflows:
# Group by file (refactoring prioritization)
# Group by similarity (quality triage)
# Group by clone type (targeted cleanup)
# Group by size (impact analysis)
Grouping strategies:
- file: See which files need refactoring most
- similarity: Prioritize high-confidence matches
- type: Handle Type-1 separately from Type-2
- size: Focus on large duplicates for maximum impact
Output Options
# Save report to file
# JSON for CI/CD pipelines
# Disable colors for logs
# Or use NO_COLOR environment variable
NO_COLOR=1
# Verbose mode with performance stats
Debug Mode
Enhanced error messages with actionable suggestions:
# Enable debug mode for troubleshooting
# Debug mode shows:
# - Current working directory
# - File access permissions
# - Parser errors with context
# - Configuration validation details
Example error output:
Error: Path does not exist: /nonexistent/path
Suggestion: Check the path spelling and ensure it exists
Example: polydup scan ./src
polydup scan /absolute/path/to/project
Debug Info: Current directory: /Users/you/project
Combining Features
Mix and match for powerful workflows:
# High-priority refactoring targets
# CI/CD duplicate gate
# Deep analysis with verbose stats
# Quick triage without noise
Dashboard Output
PolyDup provides a professional dashboard with actionable insights:
╔═══════════════════════════════════════════════════════════╗
║ Scan Results ║
╠═══════════════════════════════════════════════════════════╣
║ Files scanned: 142 ║
║ Functions analyzed: 287 ║
║ Duplicates found: 15 ║
║ Estimated savings: ~450 lines ║
╠═══════════════════════════════════════════════════════════╣
║ Clone Type Breakdown: ║
║ Type-1 (exact): 5 groups │ Critical priority ║
║ Type-2 (renamed): 8 groups │ High priority ║
║ Type-3 (modified): 2 groups │ Medium priority ║
╠═══════════════════════════════════════════════════════════╣
║ Top Offenders: ║
║ 1. src/handlers.ts 8 duplicates ║
║ 2. lib/utils.ts 5 duplicates ║
║ 3. components/Form.tsx 3 duplicates ║
╚═══════════════════════════════════════════════════════════╝
Duplicate #1 (Type-2: Renamed identifiers)
Location: src/auth.ts:45-68 ↔ src/admin.ts:120-143
Similarity: 94.2% | Length: 24 lines
...
Dashboard features:
- Lines saved estimation: Potential code reduction
- Top offenders: Files needing most attention
- Similarity range: Quality distribution (min-max)
- Priority labels: Critical (Type-1), High (Type-2), Medium (Type-3)
Exit Codes
PolyDup uses semantic exit codes for CI/CD integration:
| Exit Code | Meaning | Use Case |
|---|---|---|
0 |
No duplicates found | Clean codebase ✓ |
1 |
Duplicates detected | Quality gate (expected) |
2 |
Error occurred | Configuration/runtime issue |
CI/CD examples:
# Fail build if duplicates found
||
# Warning only (report but don't fail)
||
# Strict quality gate (fail on any duplicates)
if ; then
else
fi
Supported Languages
- JavaScript/TypeScript:
.js,.jsx,.ts,.tsx - Python:
.py - Rust:
.rs - Vue:
.vue - Svelte:
.svelte
More languages coming soon (Java, Go, C/C++, Ruby, PHP)
Clone Types
PolyDup classifies duplicates into different types based on the International Workshop on Software Clones (IWSC) taxonomy:
Type-1: Exact Clones
Identical code fragments except for whitespace, comments, and formatting.
Example:
// File 1
// File 2 (Type-1 clone - only formatting differs)
Why they exist: Direct copy-paste without any modifications.
Type-2: Renamed/Parameterized Clones
Structurally identical code with renamed identifiers, changed literals, or different types.
Example:
// File 1
// File 2 (Type-2 clone - renamed variables, same logic)
Why they exist: Copy-paste-modify pattern where developers adapt code for different contexts.
Detection: PolyDup normalizes identifiers and literals (e.g., sum → @@ID, 0 → @@NUM) to detect structural similarity.
Type-3: Near-Miss Clones (Not Yet Implemented)
Similar code with minor modifications like inserted/deleted statements or changed expressions.
Example:
// File 1
// File 2 (Type-3 clone - added logging, changed discount logic)
Why they exist: Code evolution, bug fixes, or feature additions that slightly modify duplicated logic.
Type-4: Semantic Clones (Not Yet Implemented)
Functionally equivalent code with different implementations.
Example:
// File 1 - Imperative loop
// File 2 - Functional approach
// File 3 - Recursive
Why they exist: Different programming paradigms or styles achieving the same result.
Detection Challenge: Requires semantic analysis, control-flow graphs, or ML-based approaches.
Understanding Your Results
When PolyDup reports duplicates, the clone type indicates:
- Type-1: Exact copy-paste → Quick win for extraction into shared utilities
- Type-2: Adapted copy-paste → Candidate for parameterized functions or generics
- Type-3: Modified duplicates → May require refactoring with strategy patterns
- Type-4: Semantic equivalence → Consider standardizing on one implementation
Typical Real-World Distribution:
- Type-1: 5-10% (rare in mature codebases)
- Type-2: 60-70% (most common - copy-paste-modify)
- Type-3: 20-30% (evolved duplicates)
- Type-4: <5% (requires specialized detection)
Performance Note: PolyDup efficiently handles codebases up to 100K LOC. Tested on real-world projects with detection times under 5 seconds for most repos.
Troubleshooting
Common Issues
"No duplicates found" but you expect some
Possible causes:
- Threshold too high: Try lowering
--thresholdto 0.70-0.80 - Block size too large: Reduce
--min-block-sizeto 5-10 lines - Type-3 not enabled: Add
--enable-type3for gap-tolerant matching
# More sensitive scan
"Too many false positives"
Solutions:
- Increase threshold: Use
--threshold 0.95for high-confidence matches - Exclude Type-3: Add
--exclude-type type-3to remove noisy matches - Increase block size: Use
--min-block-size 50for substantial duplicates only
# Strict, high-quality scan
"Permission denied" errors
Fix:
# Check file permissions
# Run with proper permissions
# Use --debug to see detailed error info
"Unsupported file type" warnings
Explanation: PolyDup currently supports JavaScript, TypeScript, Python, Rust, Vue, and Svelte. Other file types are skipped automatically.
Workaround:
- Wait for language support (check GitHub issues)
- Contribute a parser (see CONTRIBUTING.md)
Colors not working in CI/CD
Solution:
# Disable colors explicitly
# Or use environment variable
NO_COLOR=1
"Out of memory" on large codebases
Solutions:
# Increase minimum block size to reduce memory usage
# Scan directories separately
# Exclude generated/vendor code
# Create .polyduprc.toml with exclude patterns
Performance Tips
For large codebases (>50K LOC):
- Use
--min-block-size 50-100to focus on substantial duplicates - Disable Type-3 detection (it's more computationally expensive)
- Use
--exclude-type type-3to skip gap-tolerant matching - Increase
--thresholdto 0.95 to reduce candidate matches
For monorepos:
- Create
.polyduprc.tomlat root with shared configuration - Use
--group-by fileto organize results by module - Exclude
node_modules,dist,target, etc. in config
For CI/CD:
- Cache the
polydupbinary to speed up pipeline - Use
--format jsonfor machine-readable output - Set appropriate exit code handling (0=clean, 1=duplicates, 2=error)
Getting Help
Debug Mode:
# Enable detailed error traces
Verbose Output:
# Show performance statistics
Report an Issue:
- Check existing issues
- Include:
- PolyDup version (
polydup --version) - Operating system and architecture
- Command that failed
- Error message with
--debugflag - Sample code if applicable (anonymized)
- PolyDup version (
Community:
- GitHub Discussions: Ask questions
- GitHub Issues: Report bugs
Development
Building from Source
Prerequisites:
- Rust 1.70+ (
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh) - Node.js 16+ (for Node.js bindings)
- Python 3.8-3.12 (for Python bindings)
CLI:
Node.js bindings:
Python bindings:
Run tests:
# All tests
# Specific crate
# With coverage
Creating a Release
Recommended: Create releases directly from GitHub UI - fully automated, no local tools required!
- Go to Releases → New Release
- Create a new tag (e.g.,
v0.2.7) - Click "Publish release"
- Everything happens automatically (~5-7 minutes):
- Syncs version files (Cargo.toml, package.json, pyproject.toml)
- Updates CHANGELOG.md with release entry
- Moves tag to version-synced commit (if needed)
- Builds binaries for all 5 platforms (macOS/Linux/Windows)
- Publishes to crates.io, npm, and PyPI
- Creates release with binary assets
- Zero manual steps required - truly one-click releases!
Alternative: Use the release script locally:
See docs/RELEASE.md for detailed instructions.
Pre-commit Hooks
Install pre-commit hooks to automatically run linting and tests:
# Install pre-commit (if not already installed)
# Install the git hooks
# Run manually on all files
The hooks will automatically run:
- On commit:
cargo fmt,cargo clippy, file checks (trailing whitespace, YAML/TOML validation) - On push: Full test suite with
cargo test
To skip hooks temporarily:
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Install pre-commit hooks (
pre-commit install) - Make your changes and ensure tests pass (
cargo test --workspace) - Run clippy (
cargo clippy --workspace --all-targets -- -D warnings) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
See CONTRIBUTING.md for detailed guidelines.
License
MIT OR Apache-2.0