PolyDup

Cross-language duplicate code detector powered by Tree-sitter and Rust.

Features

Blazing Fast: Parallel processing with Rabin-Karp rolling hash algorithm (up to 10x faster than regex-based detectors)
Cross-Language: JavaScript, TypeScript, Python, Rust, Vue, Svelte (more coming)
Accurate: Tree-sitter AST parsing for semantic-aware detection (eliminates false positives from comments/strings)
Multi-Platform: CLI, Node.js npm package, Python pip package, Rust library
Configurable: Adjust thresholds and block sizes for your needs
Efficient: Zero-copy FFI bindings for minimal overhead (passes file paths, not contents)

Architecture

Shared Core Architecture: All duplicate detection logic lives in Rust, exposed via FFI bindings.

┌─────────────────────────────────────────────┐
│           polydup-core (Rust)               │
│  • Tree-sitter parsing                      │
│  • Rabin-Karp hashing                       │
│  • Parallel file scanning                   │
│  • Duplicate detection                      │
└─────────────────────────────────────────────┘
          ▲          ▲          ▲
          │          │          │
    ┌─────┴───┐  ┌───┴────┐  ┌─┴─────┐
    │ CLI     │  │ Node.js│  │ Python│
    │ (Rust)  │  │(napi-rs)│  │(PyO3) │
    └─────────┘  └────────┘  └───────┘

Crates:

polydup-core: Pure Rust library with Tree-sitter parsing, hashing, and reporting
polydup-cli: Standalone CLI tool (cargo install polydup-cli)
polydup-node: Node.js native addon via napi-rs (npm install @polydup/core)
polydup-py: Python extension module via PyO3 (pip install polydup)

Installation

Rust CLI (Recommended)

The fastest way to use PolyDup is via the CLI tool:

# Install from crates.io
cargo install polydup-cli

# Verify installation
polydup --version

# Scan for duplicates
polydup scan ./src

System Requirements:

Rust 1.70+ (if building from source)
macOS, Linux, or Windows

Note: Homebrew tap coming soon! (brew install polydup)

Pre-built Binaries:

Download pre-compiled binaries from GitHub Releases:

# macOS (Apple Silicon)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-macos-aarch64 -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/

# macOS (Intel)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-macos-x86_64 -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/

# Linux (x86_64)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-linux-x86_64 -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/

# Linux (x86_64 static - musl)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-linux-x86_64-musl -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/

# Windows (x86_64)
# Download polydup-windows-x86_64.exe from releases page and add to PATH

Node.js/npm

Install as a project dependency or globally:

# Project dependency
npm install @polydup/core

# Global installation
npm install -g @polydup/core

Requirements: Node.js 16+ on macOS (Intel/ARM), Windows (x64), or Linux (x64)

Usage:

const { findDuplicates } = require('@polydup/core');

const duplicates = findDuplicates(
  ['src/', 'tests/'],  // Paths to scan
  10,                  // Minimum block size (lines)
  0.85                 // Similarity threshold (0.0-1.0)
);

console.log(`Found ${duplicates.length} duplicates`);
duplicates.forEach(dup => {
  console.log(`${dup.file1}:${dup.start_line1} ↔ ${dup.file2}:${dup.start_line2}`);
  console.log(`Similarity: ${(dup.similarity * 100).toFixed(1)}%`);
});

Python/pip

Install from PyPI:

# Using pip
pip install polydup

# Using uv (recommended for faster installs)
uv pip install polydup

Requirements: Python 3.8-3.12 on macOS (Intel/ARM), Windows (x64), or Linux (x64)

Usage:

import polydup

# Scan for duplicates
duplicates = polydup.find_duplicates(
    paths=['src/', 'tests/'],
    min_block_size=10,
    similarity_threshold=0.85
)

print(f"Found {len(duplicates)} duplicates")
for dup in duplicates:
    print(f"{dup['file1']}:{dup['start_line1']} ↔ {dup['file2']}:{dup['start_line2']}")
    print(f"Similarity: {dup['similarity']*100:.1f}%")

Rust Library

Use the core library in your Rust project:

[dependencies]
polydup-core = "0.1"

use polydup_core::{Scanner, find_duplicates};
use std::path::PathBuf;

fn main() -> anyhow::Result<()> {
    let scanner = Scanner::with_config(10, 0.85)?;
    let report = scanner.scan(vec![PathBuf::from("src")])?;

    println!("Found {} duplicates", report.duplicates.len());
    Ok(())
}

Building from Source

CLI

cargo build --release -p polydup-cli
./target/release/polydup scan ./src

Node.js

cd crates/polydup-node
npm install
npm run build

Python

cd crates/polydup-py
maturin develop
python -c "import polydup; print(polydup.version())"

CLI Usage

Quick Start with `polydup init`

The fastest way to get started is with the interactive initialization wizard:

# Run the initialization wizard
polydup init

# Non-interactive mode (use defaults)
polydup init --yes

# Force overwrite existing configuration
polydup init --force

The wizard will:

Auto-detect your project environment (Node.js, Rust, Python, etc.)
Generate .polyduprc.toml with environment-specific defaults
Create GitHub Actions workflow (optional)
Show install instructions tailored to your environment
Provide next steps for local usage

Example workflow:

$ polydup init

PolyDup Initialization Wizard
=============================

Detected environments:
  - Node.js
  - Python

✔ Select similarity threshold: Standard (0.85)
✔ Select minimum block size: Medium (50 lines)
✔ Add custom exclude patterns? · No
✔ Would you like to create a GitHub Actions workflow? · Yes

Configuration saved to: .polyduprc.toml
GitHub Actions workflow created: .github/workflows/polydup.yml

Next Steps:
  1. Install: npm install -g @polydup/core
  2. Scan: polydup scan ./src

Configuration File (`.polyduprc.toml`)

After running polydup init, you'll have a .polyduprc.toml file:

[scan]
min_block_size = 50
similarity_threshold = 0.85

[scan.exclude]
patterns = [
    "**/node_modules/**",
    "**/__pycache__/**",
    "**/*.test.js",
    "**/*.test.py",
]

[output]
format = "text"
verbose = false

[ci]
enabled = false
fail_on_duplicates = true

Configuration Discovery:

PolyDup searches for .polyduprc.toml in current directory and parent directories
CLI arguments override config file settings
Perfect for monorepos with shared configuration at root

Basic Commands

# Scan a directory
polydup scan ./src

# Scan multiple directories
polydup scan ./src ./tests ./lib

# Custom threshold (0.0-1.0, higher = stricter)
polydup scan ./src --threshold 0.85

# Adjust minimum block size (lines)
polydup scan ./src --min-block-size 50

# JSON output for scripting
polydup scan ./src --format json > duplicates.json

Examples

Quick scan for severe duplicates:

polydup scan ./src --threshold 0.95 --min-block-size 20

Deep scan for similar code:

polydup scan ./src --threshold 0.70 --min-block-size 5

Scan specific file types:

# PolyDup auto-detects: .rs, .js, .ts, .jsx, .tsx, .py, .vue, .svelte
polydup scan ./src  # Scans all supported languages

CI/CD integration:

# Exit with error if duplicates found
polydup scan ./src --threshold 0.90 || exit 1

Output Formats

Text (default): Human-readable colored output with file paths, line numbers, and similarity scores

JSON: Machine-readable format for scripting and tooling integration

polydup scan ./src --format json | jq '.duplicates | length'

CLI Options

Option	Type	Default	Description
`--threshold`	float	0.9	Similarity threshold (0.0-1.0)
`--min-block-size`	int	10	Minimum lines per code block
`--format`	text\|json	text	Output format
`--output`	path	-	Write report to file
`--only-type`	types	-	Filter by clone type (type-1, type-2, type-3)
`--exclude-type`	types	-	Exclude clone types
`--group-by`	criterion	-	Group results (file, similarity, type, size)
`--verbose`	flag	false	Show performance statistics
`--no-color`	flag	false	Disable colored output
`--debug`	flag	false	Enable debug mode with detailed traces
`--enable-type3`	flag	false	Enable Type-3 gap-tolerant detection

Performance Tip: For large codebases (>50K LOC), increase --min-block-size to 20-50 for faster scans with less noise.

Advanced Features

Filtering by Clone Type

Focus on specific types of duplicates for targeted refactoring:

# Show only exact duplicates (highest priority)
polydup scan ./src --only-type type-1

# Show only renamed duplicates
polydup scan ./src --only-type type-2

# Show both Type-1 and Type-2
polydup scan ./src --only-type type-1,type-2

# Exclude noisy Type-3 matches
polydup scan ./src --exclude-type type-3

Use cases:

--only-type type-1: Quick wins for immediate refactoring
--only-type type-2: Identify abstraction opportunities
--exclude-type type-3: Reduce false positives in large codebases

Grouping Results

Organize duplicates for different workflows:

# Group by file (refactoring prioritization)
polydup scan ./src --group-by file

# Group by similarity (quality triage)
polydup scan ./src --group-by similarity

# Group by clone type (targeted cleanup)
polydup scan ./src --group-by type

# Group by size (impact analysis)
polydup scan ./src --group-by size

Grouping strategies:

file: See which files need refactoring most
similarity: Prioritize high-confidence matches
type: Handle Type-1 separately from Type-2
size: Focus on large duplicates for maximum impact

Output Options

# Save report to file
polydup scan ./src --output duplicates.txt

# JSON for CI/CD pipelines
polydup scan ./src --format json --output report.json

# Disable colors for logs
polydup scan ./src --no-color

# Or use NO_COLOR environment variable
NO_COLOR=1 polydup scan ./src

# Verbose mode with performance stats
polydup scan ./src --verbose

Debug Mode

Enhanced error messages with actionable suggestions:

# Enable debug mode for troubleshooting
polydup scan ./src --debug

# Debug mode shows:
# - Current working directory
# - File access permissions
# - Parser errors with context
# - Configuration validation details

Example error output:

Error: Path does not exist: /nonexistent/path

Suggestion: Check the path spelling and ensure it exists
  Example: polydup scan ./src
           polydup scan /absolute/path/to/project

Debug Info: Current directory: /Users/you/project

Combining Features

Mix and match for powerful workflows:

# High-priority refactoring targets
polydup scan ./src \
  --only-type type-1 \
  --group-by file \
  --min-block-size 50 \
  --output refactor-priorities.txt

# CI/CD duplicate gate
polydup scan ./src \
  --threshold 0.95 \
  --exclude-type type-3 \
  --format json \
  --output duplicates.json

# Deep analysis with verbose stats
polydup scan ./src \
  --enable-type3 \
  --group-by similarity \
  --verbose

# Quick triage without noise
polydup scan ./src \
  --only-type type-1,type-2 \
  --group-by type \
  --no-color

Dashboard Output

PolyDup provides a professional dashboard with actionable insights:

╔═══════════════════════════════════════════════════════════╗
║                      Scan Results                         ║
╠═══════════════════════════════════════════════════════════╣
║ Files scanned:       142                                  ║
║ Functions analyzed:  287                                  ║
║ Duplicates found:    15                                   ║
║ Estimated savings:   ~450 lines                           ║
╠═══════════════════════════════════════════════════════════╣
║ Clone Type Breakdown:                                     ║
║   Type-1 (exact):    5 groups  │ Critical priority       ║
║   Type-2 (renamed):  8 groups  │ High priority           ║
║   Type-3 (modified): 2 groups  │ Medium priority         ║
╠═══════════════════════════════════════════════════════════╣
║ Top Offenders:                                            ║
║   1. src/handlers.ts      8 duplicates                    ║
║   2. lib/utils.ts         5 duplicates                    ║
║   3. components/Form.tsx  3 duplicates                    ║
╚═══════════════════════════════════════════════════════════╝

Duplicate #1 (Type-2: Renamed identifiers)
  Location: src/auth.ts:45-68 ↔ src/admin.ts:120-143
  Similarity: 94.2% | Length: 24 lines
  ...

Dashboard features:

Lines saved estimation: Potential code reduction
Top offenders: Files needing most attention
Similarity range: Quality distribution (min-max)
Priority labels: Critical (Type-1), High (Type-2), Medium (Type-3)

Exit Codes

PolyDup uses semantic exit codes for CI/CD integration:

Exit Code	Meaning	Use Case
`0`	No duplicates found	Clean codebase ✓
`1`	Duplicates detected	Quality gate (expected)
`2`	Error occurred	Configuration/runtime issue

CI/CD examples:

# Fail build if duplicates found
polydup scan ./src || exit 1

# Warning only (report but don't fail)
polydup scan ./src || true

# Strict quality gate (fail on any duplicates)
if polydup scan ./src --threshold 0.95; then
  echo "No duplicates found"
else
  echo "⚠️ Duplicates detected - please refactor"
  exit 1
fi

Supported Languages

JavaScript/TypeScript: .js, .jsx, .ts, .tsx
Python: .py
Rust: .rs
Vue: .vue
Svelte: .svelte

More languages coming soon (Java, Go, C/C++, Ruby, PHP)

Clone Types

PolyDup classifies duplicates into different types based on the International Workshop on Software Clones (IWSC) taxonomy:

Type-1: Exact Clones

Identical code fragments except for whitespace, comments, and formatting.

Example:

// File 1
function calculateTotal(items) {
    let sum = 0;
    for (let i = 0; i < items.length; i++) {
        sum += items[i].price;
    }
    return sum;
}

// File 2 (Type-1 clone - only formatting differs)
function calculateTotal(items) {
  let sum = 0;
  for (let i = 0; i < items.length; i++) { sum += items[i].price; }
  return sum;
}

Why they exist: Direct copy-paste without any modifications.

Type-2: Renamed/Parameterized Clones

Structurally identical code with renamed identifiers, changed literals, or different types.

Example:

// File 1
function calculateTotal(items) {
    let sum = 0;
    for (let i = 0; i < items.length; i++) {
        sum += items[i].price;
    }
    return sum;
}

// File 2 (Type-2 clone - renamed variables, same logic)
function computeSum(products) {
    let total = 0;
    for (let j = 0; j < products.length; j++) {
        total += products[j].cost;
    }
    return total;
}

Why they exist: Copy-paste-modify pattern where developers adapt code for different contexts.

Detection: PolyDup normalizes identifiers and literals (e.g., sum → @@ID, 0 → @@NUM) to detect structural similarity.

Type-3: Near-Miss Clones (Not Yet Implemented)

Similar code with minor modifications like inserted/deleted statements or changed expressions.

Example:

// File 1
function processOrder(order) {
    validateOrder(order);
    let total = calculateTotal(order.items);
    applyDiscount(total, order.coupon);
    return total;
}

// File 2 (Type-3 clone - added logging, changed discount logic)
function processOrder(order) {
    validateOrder(order);
    console.log("Processing order:", order.id);  // ADDED
    let total = calculateTotal(order.items);
    let discount = order.coupon ? 0.1 : 0;      // MODIFIED
    total *= (1 - discount);                     // MODIFIED
    return total;
}

Why they exist: Code evolution, bug fixes, or feature additions that slightly modify duplicated logic.

Type-4: Semantic Clones (Not Yet Implemented)

Functionally equivalent code with different implementations.

Example:

// File 1 - Imperative loop
function sum(arr) {
    let total = 0;
    for (let i = 0; i < arr.length; i++) {
        total += arr[i];
    }
    return total;
}

// File 2 - Functional approach
function sum(arr) {
    return arr.reduce((acc, val) => acc + val, 0);
}

// File 3 - Recursive
function sum(arr, i = 0) {
    if (i >= arr.length) return 0;
    return arr[i] + sum(arr, i + 1);
}

Why they exist: Different programming paradigms or styles achieving the same result.

Detection Challenge: Requires semantic analysis, control-flow graphs, or ML-based approaches.

Understanding Your Results

When PolyDup reports duplicates, the clone type indicates:

Type-1: Exact copy-paste → Quick win for extraction into shared utilities
Type-2: Adapted copy-paste → Candidate for parameterized functions or generics
Type-3: Modified duplicates → May require refactoring with strategy patterns
Type-4: Semantic equivalence → Consider standardizing on one implementation

Typical Real-World Distribution:

Type-1: 5-10% (rare in mature codebases)
Type-2: 60-70% (most common - copy-paste-modify)
Type-3: 20-30% (evolved duplicates)
Type-4: <5% (requires specialized detection)

Performance Note: PolyDup efficiently handles codebases up to 100K LOC. Tested on real-world projects with detection times under 5 seconds for most repos.

Troubleshooting

Common Issues

"No duplicates found" but you expect some

Possible causes:

Threshold too high: Try lowering --threshold to 0.70-0.80
Block size too large: Reduce --min-block-size to 5-10 lines
Type-3 not enabled: Add --enable-type3 for gap-tolerant matching

# More sensitive scan
polydup scan ./src --threshold 0.70 --min-block-size 5 --enable-type3

"Too many false positives"

Solutions:

Increase threshold: Use --threshold 0.95 for high-confidence matches
Exclude Type-3: Add --exclude-type type-3 to remove noisy matches
Increase block size: Use --min-block-size 50 for substantial duplicates only

# Strict, high-quality scan
polydup scan ./src --threshold 0.95 --exclude-type type-3 --min-block-size 50

"Permission denied" errors

Fix:

# Check file permissions
ls -la /path/to/scan

# Run with proper permissions
chmod +r /path/to/files

# Use --debug to see detailed error info
polydup scan ./src --debug

"Unsupported file type" warnings

Explanation: PolyDup currently supports JavaScript, TypeScript, Python, Rust, Vue, and Svelte. Other file types are skipped automatically.

Workaround:

Wait for language support (check GitHub issues)
Contribute a parser (see CONTRIBUTING.md)

Colors not working in CI/CD

Solution:

# Disable colors explicitly
polydup scan ./src --no-color

# Or use environment variable
NO_COLOR=1 polydup scan ./src

"Out of memory" on large codebases

Solutions:

# Increase minimum block size to reduce memory usage
polydup scan ./src --min-block-size 100

# Scan directories separately
polydup scan ./src
polydup scan ./tests
polydup scan ./lib

# Exclude generated/vendor code
# Create .polyduprc.toml with exclude patterns

Performance Tips

For large codebases (>50K LOC):

Use --min-block-size 50-100 to focus on substantial duplicates
Disable Type-3 detection (it's more computationally expensive)
Use --exclude-type type-3 to skip gap-tolerant matching
Increase --threshold to 0.95 to reduce candidate matches

For monorepos:

Create .polyduprc.toml at root with shared configuration
Use --group-by file to organize results by module
Exclude node_modules, dist, target, etc. in config

For CI/CD:

Cache the polydup binary to speed up pipeline
Use --format json for machine-readable output
Set appropriate exit code handling (0=clean, 1=duplicates, 2=error)

Getting Help

Debug Mode:

# Enable detailed error traces
polydup scan ./src --debug

Verbose Output:

# Show performance statistics
polydup scan ./src --verbose

Report an Issue:

Check existing issues
Include:
- PolyDup version (polydup --version)
- Operating system and architecture
- Command that failed
- Error message with --debug flag
- Sample code if applicable (anonymized)

Community:

GitHub Discussions: Ask questions
GitHub Issues: Report bugs

Development

Building from Source

Prerequisites:

Rust 1.70+ (curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh)
Node.js 16+ (for Node.js bindings)
Python 3.8-3.12 (for Python bindings)

CLI:

git clone https://github.com/wiesnerbernard/polydup.git
cd polydup
cargo build --release -p polydup-cli
./target/release/polydup scan ./src

Node.js bindings:

cd crates/polydup-node
npm install
npm run build
npm test

Python bindings:

cd crates/polydup-py
pip install maturin
maturin develop
python -c "import polydup; print(polydup.version())"

Run tests:

# All tests
cargo test --workspace

# Specific crate
cargo test -p polydup-core

# With coverage
cargo install cargo-tarpaulin
cargo tarpaulin --workspace

Creating a Release

Recommended: Create releases directly from GitHub UI - fully automated, no local tools required!

Go to Releases → New Release
Create a new tag (e.g., v0.2.7)
Click "Publish release"
Everything happens automatically (~5-7 minutes):
- Syncs version files (Cargo.toml, package.json, pyproject.toml)
- Updates CHANGELOG.md with release entry
- Moves tag to version-synced commit (if needed)
- Builds binaries for all 5 platforms (macOS/Linux/Windows)
- Publishes to crates.io, npm, and PyPI
- Creates release with binary assets
- Zero manual steps required - truly one-click releases!

Alternative: Use the release script locally:

./scripts/release.sh 0.2.5

See docs/RELEASE.md for detailed instructions.

Pre-commit Hooks

Install pre-commit hooks to automatically run linting and tests:

# Install pre-commit (if not already installed)
pip install pre-commit

# Install the git hooks
pre-commit install
pre-commit install -t pre-push

# Run manually on all files
pre-commit run --all-files

The hooks will automatically run:

On commit: cargo fmt, cargo clippy, file checks (trailing whitespace, YAML/TOML validation)
On push: Full test suite with cargo test

To skip hooks temporarily:

git commit --no-verify

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Install pre-commit hooks (pre-commit install)
Make your changes and ensure tests pass (cargo test --workspace)
Run clippy (cargo clippy --workspace --all-targets -- -D warnings)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

See CONTRIBUTING.md for detailed guidelines.

License

MIT OR Apache-2.0

polydup-core 0.3.1

PolyDup

Features

Architecture

Installation

Rust CLI (Recommended)

Node.js/npm

Python/pip

Rust Library

Building from Source

CLI

Node.js

Python

CLI Usage

Quick Start with polydup init

Configuration File (.polyduprc.toml)

Basic Commands

Examples

Output Formats

CLI Options

Advanced Features

Filtering by Clone Type

Grouping Results

Output Options

Debug Mode

Combining Features

Dashboard Output

Exit Codes

Supported Languages

Clone Types

Type-1: Exact Clones

Type-2: Renamed/Parameterized Clones

Type-3: Near-Miss Clones (Not Yet Implemented)

Type-4: Semantic Clones (Not Yet Implemented)

Understanding Your Results

Troubleshooting

Common Issues

"No duplicates found" but you expect some

"Too many false positives"

"Permission denied" errors

"Unsupported file type" warnings

Colors not working in CI/CD

"Out of memory" on large codebases

Performance Tips

Getting Help

Development

Building from Source

Creating a Release

Pre-commit Hooks

Contributing

License

Quick Start with `polydup init`

Configuration File (`.polyduprc.toml`)