PolyDup
Cross-language duplicate code detector powered by Tree-sitter and Rust.
Features
- Blazing Fast: Parallel processing with Rabin-Karp rolling hash algorithm (up to 10x faster than regex-based detectors)
- Cross-Language: JavaScript, TypeScript, Python, Rust, Vue, Svelte (more coming)
- Accurate: Tree-sitter AST parsing for semantic-aware detection (eliminates false positives from comments/strings)
- Multi-Platform: CLI, Node.js npm package, Python pip package, Rust library
- Configurable: Adjust thresholds and block sizes for your needs
- Efficient: Zero-copy FFI bindings for minimal overhead (passes file paths, not contents)
Architecture
Shared Core Architecture: All duplicate detection logic lives in Rust, exposed via FFI bindings.
┌─────────────────────────────────────────────┐
│ polydup-core (Rust) │
│ • Tree-sitter parsing │
│ • Rabin-Karp hashing │
│ • Parallel file scanning │
│ • Duplicate detection │
└─────────────────────────────────────────────┘
▲ ▲ ▲
│ │ │
┌─────┴───┐ ┌───┴────┐ ┌─┴─────┐
│ CLI │ │ Node.js│ │ Python│
│ (Rust) │ │(napi-rs)│ │(PyO3) │
└─────────┘ └────────┘ └───────┘
Crates:
- polydup-core: Pure Rust library with Tree-sitter parsing, hashing, and reporting
- polydup-cli: Standalone CLI tool (
cargo install polydup-cli) - polydup-node: Node.js native addon via napi-rs (
npm install @polydup/core) - polydup-py: Python extension module via PyO3 (
pip install polydup)
Installation
Rust CLI (Recommended)
The fastest way to use PolyDup is via the CLI tool:
# Install from crates.io
# Verify installation
# Scan for duplicates
System Requirements:
- Rust 1.70+ (if building from source)
- macOS, Linux, or Windows
Note: Homebrew tap coming soon! (
brew install polydup)
Pre-built Binaries:
Download pre-compiled binaries from GitHub Releases:
# macOS (Apple Silicon)
# macOS (Intel)
# Linux (x86_64)
# Linux (x86_64 static - musl)
# Windows (x86_64)
# Download polydup-windows-x86_64.exe from releases page and add to PATH
Node.js/npm
Install as a project dependency or globally:
# Project dependency
# Global installation
Requirements: Node.js 16+ on macOS (Intel/ARM), Windows (x64), or Linux (x64)
Usage:
const = require;
const duplicates = ;
console.log;
duplicates.;
Python/pip
Install from PyPI:
# Using pip
# Using uv (recommended for faster installs)
Requirements: Python 3.8-3.12 on macOS (Intel/ARM), Windows (x64), or Linux (x64)
Usage:
# Scan for duplicates
=
Rust Library
Use the core library in your Rust project:
[]
= "0.1"
use ;
use PathBuf;
Building from Source
CLI
Node.js
Python
CLI Usage
Quick Start with polydup init
The fastest way to get started is with the interactive initialization wizard:
# Run the initialization wizard
# Non-interactive mode (use defaults)
# Force overwrite existing configuration
The wizard will:
- Auto-detect your project environment (Node.js, Rust, Python, etc.)
- Generate
.polyduprc.tomlwith environment-specific defaults - Create GitHub Actions workflow (optional)
- Show install instructions tailored to your environment
- Provide next steps for local usage
Example workflow:
=============================
)
)
Configuration File (.polyduprc.toml)
After running polydup init, you'll have a .polyduprc.toml file:
[]
= 50
= 0.85
[]
= [
"**/node_modules/**",
"**/__pycache__/**",
"**/*.test.js",
"**/*.test.py",
]
[]
= "text"
= false
[]
= false
= true
Configuration Discovery:
- PolyDup searches for
.polyduprc.tomlin current directory and parent directories - CLI arguments override config file settings
- Perfect for monorepos with shared configuration at root
Basic Commands
# Scan a directory
# Scan multiple directories
# Custom threshold (0.0-1.0, higher = stricter)
# Adjust minimum block size (lines)
# JSON output for scripting
Examples
Quick scan for severe duplicates:
Deep scan for similar code:
Scan specific file types:
# PolyDup auto-detects: .rs, .js, .ts, .jsx, .tsx, .py, .vue, .svelte
CI/CD integration:
# Exit with error if duplicates found
||
Output Formats
Text (default): Human-readable colored output with file paths, line numbers, and similarity scores
JSON: Machine-readable format for scripting and tooling integration
|
CLI Options
| Option | Type | Default | Description |
|---|---|---|---|
--threshold |
float | 0.9 | Similarity threshold (0.0-1.0) |
--min-block-size |
int | 10 | Minimum lines per code block |
--format |
text|json | text | Output format |
--output |
path | - | Write report to file |
--only-type |
types | - | Filter by clone type (type-1, type-2, type-3) |
--exclude-type |
types | - | Exclude clone types |
--group-by |
criterion | - | Group results (file, similarity, type, size) |
--verbose |
flag | false | Show performance statistics |
--no-color |
flag | false | Disable colored output |
--debug |
flag | false | Enable debug mode with detailed traces |
--enable-type3 |
flag | false | Enable Type-3 gap-tolerant detection |
Performance Tip: For large codebases (>50K LOC), increase --min-block-size to 20-50 for faster scans with less noise.
Advanced Features
Filtering by Clone Type
Focus on specific types of duplicates for targeted refactoring:
# Show only exact duplicates (highest priority)
# Show only renamed duplicates
# Show both Type-1 and Type-2
# Exclude noisy Type-3 matches
Use cases:
--only-type type-1: Quick wins for immediate refactoring--only-type type-2: Identify abstraction opportunities--exclude-type type-3: Reduce false positives in large codebases
Grouping Results
Organize duplicates for different workflows:
# Group by file (refactoring prioritization)
# Group by similarity (quality triage)
# Group by clone type (targeted cleanup)
# Group by size (impact analysis)
Grouping strategies:
- file: See which files need refactoring most
- similarity: Prioritize high-confidence matches
- type: Handle Type-1 separately from Type-2
- size: Focus on large duplicates for maximum impact
Output Options
# Save report to file
# JSON for CI/CD pipelines
# Disable colors for logs
# Or use NO_COLOR environment variable
NO_COLOR=1
# Verbose mode with performance stats
Debug Mode
Enhanced error messages with actionable suggestions:
# Enable debug mode for troubleshooting
# Debug mode shows:
# - Current working directory
# - File access permissions
# - Parser errors with context
# - Configuration validation details
Example error output:
Error: Path does not exist: /nonexistent/path
Suggestion: Check the path spelling and ensure it exists
Example: polydup scan ./src
polydup scan /absolute/path/to/project
Debug Info: Current directory: /Users/you/project
Combining Features
Mix and match for powerful workflows:
# High-priority refactoring targets
# CI/CD duplicate gate
# Deep analysis with verbose stats
# Quick triage without noise
Dashboard Output
PolyDup provides a professional dashboard with actionable insights:
╔═══════════════════════════════════════════════════════════╗
║ Scan Results ║
╠═══════════════════════════════════════════════════════════╣
║ Files scanned: 142 ║
║ Functions analyzed: 287 ║
║ Duplicates found: 15 ║
║ Estimated savings: ~450 lines ║
╠═══════════════════════════════════════════════════════════╣
║ Clone Type Breakdown: ║
║ Type-1 (exact): 5 groups │ Critical priority ║
║ Type-2 (renamed): 8 groups │ High priority ║
║ Type-3 (modified): 2 groups │ Medium priority ║
╠═══════════════════════════════════════════════════════════╣
║ Top Offenders: ║
║ 1. src/handlers.ts 8 duplicates ║
║ 2. lib/utils.ts 5 duplicates ║
║ 3. components/Form.tsx 3 duplicates ║
╚═══════════════════════════════════════════════════════════╝
Duplicate #1 (Type-2: Renamed identifiers)
Location: src/auth.ts:45-68 ↔ src/admin.ts:120-143
Similarity: 94.2% | Length: 24 lines
...
Dashboard features:
- Lines saved estimation: Potential code reduction
- Top offenders: Files needing most attention
- Similarity range: Quality distribution (min-max)
- Priority labels: Critical (Type-1), High (Type-2), Medium (Type-3)
Exit Codes
PolyDup uses semantic exit codes for CI/CD integration:
| Exit Code | Meaning | Use Case |
|---|---|---|
0 |
No duplicates found | Clean codebase ✓ |
1 |
Duplicates detected | Quality gate (expected) |
2 |
Error occurred | Configuration/runtime issue |
CI/CD examples:
# Fail build if duplicates found
||
# Warning only (report but don't fail)
||
# Strict quality gate (fail on any duplicates)
if ; then
else
fi
Supported Languages
- JavaScript/TypeScript:
.js,.jsx,.ts,.tsx - Python:
.py - Rust:
.rs - Vue:
.vue - Svelte:
.svelte
More languages coming soon (Java, Go, C/C++, Ruby, PHP)
Clone Types
PolyDup classifies duplicates into different types based on the International Workshop on Software Clones (IWSC) taxonomy:
Type-1: Exact Clones
Identical code fragments except for whitespace, comments, and formatting.
Example:
// File 1
// File 2 (Type-1 clone - only formatting differs)
Why they exist: Direct copy-paste without any modifications.
Type-2: Renamed/Parameterized Clones
Structurally identical code with renamed identifiers, changed literals, or different types.
Example:
// File 1
// File 2 (Type-2 clone - renamed variables, same logic)
Why they exist: Copy-paste-modify pattern where developers adapt code for different contexts.
Detection: PolyDup normalizes identifiers and literals (e.g., sum → @@ID, 0 → @@NUM) to detect structural similarity.
Type-3: Near-Miss Clones (Not Yet Implemented)
Similar code with minor modifications like inserted/deleted statements or changed expressions.
Example:
// File 1
// File 2 (Type-3 clone - added logging, changed discount logic)
Why they exist: Code evolution, bug fixes, or feature additions that slightly modify duplicated logic.
Type-4: Semantic Clones (Not Yet Implemented)
Functionally equivalent code with different implementations.
Example:
// File 1 - Imperative loop
// File 2 - Functional approach
// File 3 - Recursive
Why they exist: Different programming paradigms or styles achieving the same result.
Detection Challenge: Requires semantic analysis, control-flow graphs, or ML-based approaches.
Understanding Your Results
When PolyDup reports duplicates, the clone type indicates:
- Type-1: Exact copy-paste → Quick win for extraction into shared utilities
- Type-2: Adapted copy-paste → Candidate for parameterized functions or generics
- Type-3: Modified duplicates → May require refactoring with strategy patterns
- Type-4: Semantic equivalence → Consider standardizing on one implementation
Typical Real-World Distribution:
- Type-1: 5-10% (rare in mature codebases)
- Type-2: 60-70% (most common - copy-paste-modify)
- Type-3: 20-30% (evolved duplicates)
- Type-4: <5% (requires specialized detection)
Performance Note: PolyDup efficiently handles codebases up to 100K LOC. Tested on real-world projects with detection times under 5 seconds for most repos.
Troubleshooting
Common Issues
"No duplicates found" but you expect some
Possible causes:
- Threshold too high: Try lowering
--thresholdto 0.70-0.80 - Block size too large: Reduce
--min-block-sizeto 5-10 lines - Type-3 not enabled: Add
--enable-type3for gap-tolerant matching
# More sensitive scan
"Too many false positives"
Solutions:
- Increase threshold: Use
--threshold 0.95for high-confidence matches - Exclude Type-3: Add
--exclude-type type-3to remove noisy matches - Increase block size: Use
--min-block-size 50for substantial duplicates only
# Strict, high-quality scan
"Permission denied" errors
Fix:
# Check file permissions
# Run with proper permissions
# Use --debug to see detailed error info
"Unsupported file type" warnings
Explanation: PolyDup currently supports JavaScript, TypeScript, Python, Rust, Vue, and Svelte. Other file types are skipped automatically.
Workaround:
- Wait for language support (check GitHub issues)
- Contribute a parser (see CONTRIBUTING.md)
Colors not working in CI/CD
Solution:
# Disable colors explicitly
# Or use environment variable
NO_COLOR=1
"Out of memory" on large codebases
Solutions:
# Increase minimum block size to reduce memory usage
# Scan directories separately
# Exclude generated/vendor code
# Create .polyduprc.toml with exclude patterns
Performance Tips
For large codebases (>50K LOC):
- Use
--min-block-size 50-100to focus on substantial duplicates - Disable Type-3 detection (it's more computationally expensive)
- Use
--exclude-type type-3to skip gap-tolerant matching - Increase
--thresholdto 0.95 to reduce candidate matches
For monorepos:
- Create
.polyduprc.tomlat root with shared configuration - Use
--group-by fileto organize results by module - Exclude
node_modules,dist,target, etc. in config
For CI/CD:
- Cache the
polydupbinary to speed up pipeline - Use
--format jsonfor machine-readable output - Set appropriate exit code handling (0=clean, 1=duplicates, 2=error)
Getting Help
Debug Mode:
# Enable detailed error traces
Verbose Output:
# Show performance statistics
Report an Issue:
- Check existing issues
- Include:
- PolyDup version (
polydup --version) - Operating system and architecture
- Command that failed
- Error message with
--debugflag - Sample code if applicable (anonymized)
- PolyDup version (
Community:
- GitHub Discussions: Ask questions
- GitHub Issues: Report bugs
Development
Building from Source
Prerequisites:
- Rust 1.70+ (
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh) - Node.js 16+ (for Node.js bindings)
- Python 3.8-3.12 (for Python bindings)
CLI:
Node.js bindings:
Python bindings:
Run tests:
# All tests
# Specific crate
# With coverage
Creating a Release
Recommended: Create releases directly from GitHub UI - fully automated, no local tools required!
- Go to Releases → New Release
- Create a new tag (e.g.,
v0.2.7) - Click "Publish release"
- Everything happens automatically (~5-7 minutes):
- Syncs version files (Cargo.toml, package.json, pyproject.toml)
- Updates CHANGELOG.md with release entry
- Moves tag to version-synced commit (if needed)
- Builds binaries for all 5 platforms (macOS/Linux/Windows)
- Publishes to crates.io, npm, and PyPI
- Creates release with binary assets
- Zero manual steps required - truly one-click releases!
Alternative: Use the release script locally:
See docs/RELEASE.md for detailed instructions.
Pre-commit Hooks
Install pre-commit hooks to automatically run linting and tests:
# Install pre-commit (if not already installed)
# Install the git hooks
# Run manually on all files
The hooks will automatically run:
- On commit:
cargo fmt,cargo clippy, file checks (trailing whitespace, YAML/TOML validation) - On push: Full test suite with
cargo test
To skip hooks temporarily:
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Install pre-commit hooks (
pre-commit install) - Make your changes and ensure tests pass (
cargo test --workspace) - Run clippy (
cargo clippy --workspace --all-targets -- -D warnings) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
See CONTRIBUTING.md for detailed guidelines.
License
MIT OR Apache-2.0