PolyDup
Cross-language duplicate code detector powered by Tree-sitter and Rust.
Features
- Blazing Fast: Parallel processing with Rabin-Karp rolling hash algorithm (up to 10x faster than regex-based detectors)
- Cross-Language: JavaScript, TypeScript, Python, Rust, Vue, Svelte (more coming)
- Accurate: Tree-sitter AST parsing for semantic-aware detection (eliminates false positives from comments/strings)
- Multi-Platform: CLI, Node.js npm package, Python pip package, Rust library
- Configurable: Adjust thresholds and block sizes for your needs
- Efficient: Zero-copy FFI bindings for minimal overhead (passes file paths, not contents)
Architecture
Shared Core Architecture: All duplicate detection logic lives in Rust, exposed via FFI bindings.
┌─────────────────────────────────────────────┐
│ polydup-core (Rust) │
│ • Tree-sitter parsing │
│ • Rabin-Karp hashing │
│ • Parallel file scanning │
│ • Duplicate detection │
└─────────────────────────────────────────────┘
▲ ▲ ▲
│ │ │
┌─────┴───┐ ┌───┴────┐ ┌─┴─────┐
│ CLI │ │ Node.js│ │ Python│
│ (Rust) │ │(napi-rs)│ │(PyO3) │
└─────────┘ └────────┘ └───────┘
Crates:
- polydup-core: Pure Rust library with Tree-sitter parsing, hashing, and reporting
- polydup-cli: Standalone CLI tool (
cargo install polydup-cli) - polydup-node: Node.js native addon via napi-rs (
npm install @polydup/core) - polydup-py: Python extension module via PyO3 (
pip install polydup)
Installation
Rust CLI (Recommended)
The fastest way to use PolyDup is via the CLI tool:
# Install from crates.io
# Verify installation
# Scan for duplicates
System Requirements:
- Rust 1.70+ (if building from source)
- macOS, Linux, or Windows
Note: Homebrew tap coming soon! (
brew install polydup)
Pre-built Binaries:
Download pre-compiled binaries from GitHub Releases:
# macOS (Apple Silicon)
# macOS (Intel)
# Linux (x86_64)
# Linux (x86_64 static - musl)
# Windows (x86_64)
# Download polydup-windows-x86_64.exe from releases page and add to PATH
Node.js/npm
Install as a project dependency or globally:
# Project dependency
# Global installation
Requirements: Node.js 16+ on macOS (Intel/ARM), Windows (x64), or Linux (x64)
Usage:
const = require;
const duplicates = ;
console.log;
duplicates.;
Python/pip
Install from PyPI:
# Using pip
# Using uv (recommended for faster installs)
Requirements: Python 3.8-3.12 on macOS (Intel/ARM), Windows (x64), or Linux (x64)
Usage:
# Scan for duplicates
=
Rust Library
Use the core library in your Rust project:
[]
= "0.1"
use ;
use PathBuf;
Building from Source
CLI
Node.js
Python
CLI Usage
Quick Start with polydup init
The fastest way to get started is with the interactive initialization wizard:
# Run the initialization wizard
# Non-interactive mode (use defaults)
# Force overwrite existing configuration
The wizard will:
- Auto-detect your project environment (Node.js, Rust, Python, etc.)
- Generate
.polyduprc.tomlwith environment-specific defaults - Create GitHub Actions workflow (optional)
- Show install instructions tailored to your environment
- Provide next steps for local usage
Example workflow:
=============================
)
)
Configuration File (.polyduprc.toml)
After running polydup init, you'll have a .polyduprc.toml file:
[]
= 50
= 0.85
[]
= [
"**/node_modules/**",
"**/__pycache__/**",
"**/*.test.js",
"**/*.test.py",
]
[]
= "text"
= false
[]
= false
= true
Configuration Discovery:
- PolyDup searches for
.polyduprc.tomlin current directory and parent directories - CLI arguments override config file settings
- Perfect for monorepos with shared configuration at root
Basic Commands
# Scan a directory
# Scan multiple directories
# Custom threshold (0.0-1.0, higher = stricter)
# Adjust minimum block size (lines)
# JSON output for scripting
Examples
Quick scan for severe duplicates:
Deep scan for similar code:
Scan specific file types:
# PolyDup auto-detects: .rs, .js, .ts, .jsx, .tsx, .py, .vue, .svelte
CI/CD integration:
# Exit with error if duplicates found
||
Output Formats
Text (default): Human-readable colored output with file paths, line numbers, and similarity scores
JSON: Machine-readable format for scripting and tooling integration
|
CLI Options
| Option | Type | Default | Description |
|---|---|---|---|
--threshold |
float | 0.9 | Similarity threshold (0.0-1.0) |
--min-block-size |
int | 10 | Minimum lines per code block |
--format |
text|json | text | Output format |
Performance Tip: For large codebases (>50K LOC), increase --min-block-size to 20-50 for faster scans with less noise.
Supported Languages
- JavaScript/TypeScript:
.js,.jsx,.ts,.tsx - Python:
.py - Rust:
.rs - Vue:
.vue - Svelte:
.svelte
More languages coming soon (Java, Go, C/C++, Ruby, PHP)
Clone Types
PolyDup classifies duplicates into different types based on the International Workshop on Software Clones (IWSC) taxonomy:
Type-1: Exact Clones
Identical code fragments except for whitespace, comments, and formatting.
Example:
// File 1
// File 2 (Type-1 clone - only formatting differs)
Why they exist: Direct copy-paste without any modifications.
Type-2: Renamed/Parameterized Clones
Structurally identical code with renamed identifiers, changed literals, or different types.
Example:
// File 1
// File 2 (Type-2 clone - renamed variables, same logic)
Why they exist: Copy-paste-modify pattern where developers adapt code for different contexts.
Detection: PolyDup normalizes identifiers and literals (e.g., sum → @@ID, 0 → @@NUM) to detect structural similarity.
Type-3: Near-Miss Clones (Not Yet Implemented)
Similar code with minor modifications like inserted/deleted statements or changed expressions.
Example:
// File 1
// File 2 (Type-3 clone - added logging, changed discount logic)
Why they exist: Code evolution, bug fixes, or feature additions that slightly modify duplicated logic.
Type-4: Semantic Clones (Not Yet Implemented)
Functionally equivalent code with different implementations.
Example:
// File 1 - Imperative loop
// File 2 - Functional approach
// File 3 - Recursive
Why they exist: Different programming paradigms or styles achieving the same result.
Detection Challenge: Requires semantic analysis, control-flow graphs, or ML-based approaches.
Understanding Your Results
When PolyDup reports duplicates, the clone type indicates:
- Type-1: Exact copy-paste → Quick win for extraction into shared utilities
- Type-2: Adapted copy-paste → Candidate for parameterized functions or generics
- Type-3: Modified duplicates → May require refactoring with strategy patterns
- Type-4: Semantic equivalence → Consider standardizing on one implementation
Typical Real-World Distribution:
- Type-1: 5-10% (rare in mature codebases)
- Type-2: 60-70% (most common - copy-paste-modify)
- Type-3: 20-30% (evolved duplicates)
- Type-4: <5% (requires specialized detection)
Performance Note: PolyDup efficiently handles codebases up to 100K LOC. Tested on real-world projects with detection times under 5 seconds for most repos.
Development
Building from Source
Prerequisites:
- Rust 1.70+ (
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh) - Node.js 16+ (for Node.js bindings)
- Python 3.8-3.12 (for Python bindings)
CLI:
Node.js bindings:
Python bindings:
Run tests:
# All tests
# Specific crate
# With coverage
Creating a Release
Recommended: Create releases directly from GitHub UI - fully automated, no local tools required!
- Go to Releases → New Release
- Create a new tag (e.g.,
v0.2.7) - Click "Publish release"
- Everything happens automatically (~5-7 minutes):
- Syncs version files (Cargo.toml, package.json, pyproject.toml)
- Updates CHANGELOG.md with release entry
- Moves tag to version-synced commit (if needed)
- Builds binaries for all 5 platforms (macOS/Linux/Windows)
- Publishes to crates.io, npm, and PyPI
- Creates release with binary assets
- Zero manual steps required - truly one-click releases!
Alternative: Use the release script locally:
See docs/RELEASE.md for detailed instructions.
Pre-commit Hooks
Install pre-commit hooks to automatically run linting and tests:
# Install pre-commit (if not already installed)
# Install the git hooks
# Run manually on all files
The hooks will automatically run:
- On commit:
cargo fmt,cargo clippy, file checks (trailing whitespace, YAML/TOML validation) - On push: Full test suite with
cargo test
To skip hooks temporarily:
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Install pre-commit hooks (
pre-commit install) - Make your changes and ensure tests pass (
cargo test --workspace) - Run clippy (
cargo clippy --workspace --all-targets -- -D warnings) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
See CONTRIBUTING.md for detailed guidelines.
License
MIT OR Apache-2.0