# PolyDup
[](https://crates.io/crates/polydup-cli)
[](https://www.npmjs.com/package/@polydup/core)
[](https://pypi.org/project/polydup/)
[](https://github.com/wiesnerbernard/polydup/actions/workflows/ci.yml)
[](LICENSE)
Cross-language duplicate code detector powered by Tree-sitter and Rust.
## Features
- **Blazing Fast**: Parallel processing with Rabin-Karp rolling hash algorithm (up to 10x faster than regex-based detectors)
- **Cross-Language**: JavaScript, TypeScript, Python, Rust, Vue, Svelte (more coming)
- **Accurate**: Tree-sitter AST parsing for semantic-aware detection (eliminates false positives from comments/strings)
- **Multi-Platform**: CLI, Node.js npm package, Python pip package, Rust library
- **Configurable**: Adjust thresholds and block sizes for your needs
- **Efficient**: Zero-copy FFI bindings for minimal overhead (passes file paths, not contents)
## Architecture
**Shared Core Architecture**: All duplicate detection logic lives in Rust, exposed via FFI bindings.
```
┌─────────────────────────────────────────────┐
│ polydup-core (Rust) │
│ • Tree-sitter parsing │
│ • Rabin-Karp hashing │
│ • Parallel file scanning │
│ • Duplicate detection │
└─────────────────────────────────────────────┘
▲ ▲ ▲
│ │ │
┌─────┴───┐ ┌───┴────┐ ┌─┴─────┐
│ CLI │ │ Node.js│ │ Python│
│ (Rust) │ │(napi-rs)│ │(PyO3) │
└─────────┘ └────────┘ └───────┘
```
**Crates:**
- **polydup-core**: Pure Rust library with Tree-sitter parsing, hashing, and reporting
- **polydup-cli**: Standalone CLI tool (`cargo install polydup-cli`)
- **polydup-node**: Node.js native addon via napi-rs (`npm install @polydup/core`)
- **polydup-py**: Python extension module via PyO3 (`pip install polydup`)
## Installation
### Rust CLI (Recommended)
The fastest way to use PolyDup is via the CLI tool:
```bash
# Install from crates.io
cargo install polydup-cli
# Verify installation
polydup --version
# Scan for duplicates
polydup scan ./src
```
**System Requirements:**
- Rust 1.70+ (if building from source)
- macOS, Linux, or Windows
> **Note**: Homebrew tap coming soon! (`brew install polydup`)
**Pre-built Binaries:**
Download pre-compiled binaries from [GitHub Releases](https://github.com/wiesnerbernard/polydup/releases):
```bash
# macOS (Apple Silicon)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-macos-aarch64 -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/
# macOS (Intel)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-macos-x86_64 -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/
# Linux (x86_64)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-linux-x86_64 -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/
# Linux (x86_64 static - musl)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-linux-x86_64-musl -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/
# Windows (x86_64)
# Download polydup-windows-x86_64.exe from releases page and add to PATH
```
### Node.js/npm
Install as a project dependency or globally:
```bash
# Project dependency
npm install @polydup/core
# Global installation
npm install -g @polydup/core
```
**Requirements:** Node.js 16+ on macOS (Intel/ARM), Windows (x64), or Linux (x64)
**Usage:**
```javascript
const { findDuplicates } = require('@polydup/core');
const duplicates = findDuplicates(
['src/', 'tests/'], // Paths to scan
10, // Minimum block size (lines)
0.85 // Similarity threshold (0.0-1.0)
);
console.log(`Found ${duplicates.length} duplicates`);
duplicates.forEach(dup => {
console.log(`${dup.file1}:${dup.start_line1} ↔ ${dup.file2}:${dup.start_line2}`);
console.log(`Similarity: ${(dup.similarity * 100).toFixed(1)}%`);
});
```
### Python/pip
Install from PyPI:
```bash
# Using pip
pip install polydup
# Using uv (recommended for faster installs)
uv pip install polydup
```
**Requirements:** Python 3.8-3.12 on macOS (Intel/ARM), Windows (x64), or Linux (x64)
**Usage:**
```python
import polydup
# Scan for duplicates
duplicates = polydup.find_duplicates(
paths=['src/', 'tests/'],
min_block_size=10,
similarity_threshold=0.85
)
print(f"Found {len(duplicates)} duplicates")
for dup in duplicates:
print(f"{dup['file1']}:{dup['start_line1']} ↔ {dup['file2']}:{dup['start_line2']}")
print(f"Similarity: {dup['similarity']*100:.1f}%")
```
### Rust Library
Use the core library in your Rust project:
```toml
[dependencies]
polydup-core = "0.1"
```
```rust
use polydup_core::{Scanner, find_duplicates};
use std::path::PathBuf;
fn main() -> anyhow::Result<()> {
let scanner = Scanner::with_config(10, 0.85)?;
let report = scanner.scan(vec![PathBuf::from("src")])?;
println!("Found {} duplicates", report.duplicates.len());
Ok(())
}
```
## Building from Source
### CLI
```bash
cargo build --release -p polydup-cli
./target/release/polydup scan ./src
```
### Node.js
```bash
cd crates/polydup-node
npm install
npm run build
```
### Python
```bash
cd crates/polydup-py
maturin develop
python -c "import polydup; print(polydup.version())"
```
## CLI Usage
### Quick Start with `polydup init`
The fastest way to get started is with the interactive initialization wizard:
```bash
# Run the initialization wizard
polydup init
# Non-interactive mode (use defaults)
polydup init --yes
# Force overwrite existing configuration
polydup init --force
```
The wizard will:
- **Auto-detect your project environment** (Node.js, Rust, Python, etc.)
- **Generate `.polyduprc.toml`** with environment-specific defaults
- **Create GitHub Actions workflow** (optional)
- **Show install instructions** tailored to your environment
- **Provide next steps** for local usage
**Example workflow:**
```bash
$ polydup init
PolyDup Initialization Wizard
=============================
Detected environments:
- Node.js
- Python
✔ Select similarity threshold: Standard (0.85)
✔ Select minimum block size: Medium (50 lines)
✔ Add custom exclude patterns? · No
✔ Would you like to create a GitHub Actions workflow? · Yes
Configuration saved to: .polyduprc.toml
GitHub Actions workflow created: .github/workflows/polydup.yml
Next Steps:
1. Install: npm install -g @polydup/core
2. Scan: polydup scan ./src
```
### Configuration File (`.polyduprc.toml`)
After running `polydup init`, you'll have a `.polyduprc.toml` file:
```toml
[scan]
min_block_size = 50
similarity_threshold = 0.85
[scan.exclude]
patterns = [
"**/node_modules/**",
"**/__pycache__/**",
"**/*.test.js",
"**/*.test.py",
]
[output]
format = "text"
verbose = false
[ci]
enabled = false
fail_on_duplicates = true
```
**Configuration Discovery:**
- PolyDup searches for `.polyduprc.toml` in current directory and parent directories
- CLI arguments override config file settings
- Perfect for monorepos with shared configuration at root
### Basic Commands
```bash
# Scan a directory
polydup scan ./src
# Scan multiple directories
polydup scan ./src ./tests ./lib
# Custom threshold (0.0-1.0, higher = stricter)
polydup scan ./src --threshold 0.85
# Adjust minimum block size (lines)
polydup scan ./src --min-block-size 50
# JSON output for scripting
polydup scan ./src --format json > duplicates.json
```
### Examples
**Quick scan for severe duplicates:**
```bash
polydup scan ./src --threshold 0.95 --min-block-size 20
```
**Deep scan for similar code:**
```bash
polydup scan ./src --threshold 0.70 --min-block-size 5
```
**Scan specific file types:**
```bash
# PolyDup auto-detects: .rs, .js, .ts, .jsx, .tsx, .py, .vue, .svelte
polydup scan ./src # Scans all supported languages
```
**CI/CD integration:**
```bash
# Exit with error if duplicates found
### Output Formats
**Text (default):** Human-readable colored output with file paths, line numbers, and similarity scores
**JSON:** Machine-readable format for scripting and tooling integration
```bash
### CLI Options
| `--threshold` | float | 0.9 | Similarity threshold (0.0-1.0) |
| `--min-block-size` | int | 10 | Minimum lines per code block |
| `--format` | text\|json | text | Output format |
**Performance Tip**: For large codebases (>50K LOC), increase `--min-block-size` to 20-50 for faster scans with less noise.
## Supported Languages
- **JavaScript/TypeScript**: `.js`, `.jsx`, `.ts`, `.tsx`
- **Python**: `.py`
- **Rust**: `.rs`
- **Vue**: `.vue`
- **Svelte**: `.svelte`
More languages coming soon (Java, Go, C/C++, Ruby, PHP)
## Clone Types
PolyDup classifies duplicates into different types based on the International Workshop on Software Clones (IWSC) taxonomy:
### Type-1: Exact Clones
Identical code fragments except for whitespace, comments, and formatting.
**Example:**
```javascript
// File 1
function calculateTotal(items) {
let sum = 0;
for (let i = 0; i < items.length; i++) {
sum += items[i].price;
}
return sum;
}
// File 2 (Type-1 clone - only formatting differs)
function calculateTotal(items) {
let sum = 0;
for (let i = 0; i < items.length; i++) { sum += items[i].price; }
return sum;
}
```
**Why they exist:** Direct copy-paste without any modifications.
### Type-2: Renamed/Parameterized Clones
Structurally identical code with renamed identifiers, changed literals, or different types.
**Example:**
```javascript
// File 1
function calculateTotal(items) {
let sum = 0;
for (let i = 0; i < items.length; i++) {
sum += items[i].price;
}
return sum;
}
// File 2 (Type-2 clone - renamed variables, same logic)
function computeSum(products) {
let total = 0;
for (let j = 0; j < products.length; j++) {
total += products[j].cost;
}
return total;
}
```
**Why they exist:** Copy-paste-modify pattern where developers adapt code for different contexts.
**Detection:** PolyDup normalizes identifiers and literals (e.g., `sum` → `@@ID`, `0` → `@@NUM`) to detect structural similarity.
### Type-3: Near-Miss Clones (Not Yet Implemented)
Similar code with minor modifications like inserted/deleted statements or changed expressions.
**Example:**
```javascript
// File 1
function processOrder(order) {
validateOrder(order);
let total = calculateTotal(order.items);
applyDiscount(total, order.coupon);
return total;
}
// File 2 (Type-3 clone - added logging, changed discount logic)
function processOrder(order) {
validateOrder(order);
console.log("Processing order:", order.id); // ADDED
let total = calculateTotal(order.items);
let discount = order.coupon ? 0.1 : 0; // MODIFIED
total *= (1 - discount); // MODIFIED
return total;
}
```
**Why they exist:** Code evolution, bug fixes, or feature additions that slightly modify duplicated logic.
### Type-4: Semantic Clones (Not Yet Implemented)
Functionally equivalent code with different implementations.
**Example:**
```javascript
// File 1 - Imperative loop
function sum(arr) {
let total = 0;
for (let i = 0; i < arr.length; i++) {
total += arr[i];
}
return total;
}
// File 2 - Functional approach
function sum(arr) {
return arr.reduce((acc, val) => acc + val, 0);
}
// File 3 - Recursive
function sum(arr, i = 0) {
if (i >= arr.length) return 0;
return arr[i] + sum(arr, i + 1);
}
```
**Why they exist:** Different programming paradigms or styles achieving the same result.
**Detection Challenge:** Requires semantic analysis, control-flow graphs, or ML-based approaches.
### Understanding Your Results
When PolyDup reports duplicates, the clone type indicates:
- **Type-1**: Exact copy-paste → Quick win for extraction into shared utilities
- **Type-2**: Adapted copy-paste → Candidate for parameterized functions or generics
- **Type-3**: Modified duplicates → May require refactoring with strategy patterns
- **Type-4**: Semantic equivalence → Consider standardizing on one implementation
**Typical Real-World Distribution:**
- Type-1: 5-10% (rare in mature codebases)
- Type-2: 60-70% (most common - copy-paste-modify)
- Type-3: 20-30% (evolved duplicates)
- Type-4: <5% (requires specialized detection)
**Performance Note**: PolyDup efficiently handles codebases up to 100K LOC. Tested on real-world projects with detection times under 5 seconds for most repos.
## Development
### Building from Source
**Prerequisites:**
- Rust 1.70+ (`curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh`)
- Node.js 16+ (for Node.js bindings)
- Python 3.8-3.12 (for Python bindings)
**CLI:**
```bash
git clone https://github.com/wiesnerbernard/polydup.git
cd polydup
cargo build --release -p polydup-cli
./target/release/polydup scan ./src
```
**Node.js bindings:**
```bash
cd crates/polydup-node
npm install
npm run build
npm test
```
**Python bindings:**
```bash
cd crates/polydup-py
pip install maturin
maturin develop
python -c "import polydup; print(polydup.version())"
```
**Run tests:**
```bash
# All tests
cargo test --workspace
# Specific crate
cargo test -p polydup-core
# With coverage
cargo install cargo-tarpaulin
cargo tarpaulin --workspace
```
### Creating a Release
**Recommended**: Create releases directly from GitHub UI - fully automated, no local tools required!
1. Go to [Releases → New Release](https://github.com/wiesnerbernard/polydup/releases/new)
2. Create a new tag (e.g., `v0.2.7`)
3. Click "Publish release"
4. **Everything happens automatically** (~5-7 minutes):
- Syncs version files (Cargo.toml, package.json, pyproject.toml)
- Updates CHANGELOG.md with release entry
- Moves tag to version-synced commit (if needed)
- Builds binaries for all 5 platforms (macOS/Linux/Windows)
- Publishes to crates.io, npm, and PyPI
- Creates release with binary assets
- **Zero manual steps required - truly one-click releases!**
**Alternative**: Use the release script locally:
```bash
./scripts/release.sh 0.2.5
```
See [docs/RELEASE.md](docs/RELEASE.md) for detailed instructions.
### Pre-commit Hooks
Install pre-commit hooks to automatically run linting and tests:
```bash
# Install pre-commit (if not already installed)
pip install pre-commit
# Install the git hooks
pre-commit install
pre-commit install -t pre-push
# Run manually on all files
pre-commit run --all-files
```
The hooks will automatically run:
- **On commit**: `cargo fmt`, `cargo clippy`, file checks (trailing whitespace, YAML/TOML validation)
- **On push**: Full test suite with `cargo test`
To skip hooks temporarily:
```bash
git commit --no-verify
```
## Contributing
Contributions are welcome! Please:
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Install pre-commit hooks (`pre-commit install`)
4. Make your changes and ensure tests pass (`cargo test --workspace`)
5. Run clippy (`cargo clippy --workspace --all-targets -- -D warnings`)
6. Commit your changes (`git commit -m 'Add amazing feature'`)
7. Push to the branch (`git push origin feature/amazing-feature`)
8. Open a Pull Request
See [CONTRIBUTING.md](CONTRIBUTING.md) for detailed guidelines.
## License
MIT OR Apache-2.0