polydup-core 0.3.3

Cross-language duplicate code detection library using Tree-sitter and Rabin-Karp
Documentation

PolyDup

Crates.io npm PyPI CI Coverage Tests License

Cross-language duplicate code detector powered by Tree-sitter and Rust.

Features

  • Blazing Fast: Parallel processing with Rabin-Karp rolling hash algorithm (up to 10x faster than regex-based detectors)
  • Cross-Language: JavaScript, TypeScript, Python, Rust, Vue, Svelte (more coming)
  • Accurate: Tree-sitter AST parsing for semantic-aware detection (eliminates false positives from comments/strings)
  • Multi-Platform: CLI, Node.js npm package, Python pip package, Rust library
  • Configurable: Adjust thresholds and block sizes for your needs
  • Efficient: Zero-copy FFI bindings for minimal overhead (passes file paths, not contents)

Architecture

Shared Core Architecture: All duplicate detection logic lives in Rust, exposed via FFI bindings.

┌─────────────────────────────────────────────┐
│           polydup-core (Rust)               │
│  • Tree-sitter parsing                      │
│  • Rabin-Karp hashing                       │
│  • Parallel file scanning                   │
│  • Duplicate detection                      │
└─────────────────────────────────────────────┘
          ▲          ▲          ▲
          │          │          │
    ┌─────┴───┐  ┌───┴────┐  ┌─┴─────┐
    │ CLI     │  │ Node.js│  │ Python│
    │ (Rust)  │  │(napi-rs)│  │(PyO3) │
    └─────────┘  └────────┘  └───────┘

Crates:

  • polydup-core: Pure Rust library with Tree-sitter parsing, hashing, and reporting
  • polydup-cli: Standalone CLI tool (cargo install polydup-cli)
  • polydup-node: Node.js native addon via napi-rs (npm install @polydup/core)
  • polydup-py: Python extension module via PyO3 (pip install polydup)

Installation

Rust CLI (Recommended)

The fastest way to use PolyDup is via the CLI tool:

# Install from crates.io
cargo install polydup-cli

# Verify installation
polydup --version

# Scan for duplicates
polydup scan ./src

System Requirements:

  • Rust 1.70+ (if building from source)
  • macOS, Linux, or Windows

Note: Homebrew tap coming soon! (brew install polydup)

Pre-built Binaries:

Download pre-compiled binaries from GitHub Releases:

# macOS (Apple Silicon)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-macos-aarch64 -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/

# macOS (Intel)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-macos-x86_64 -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/

# Linux (x86_64)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-linux-x86_64 -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/

# Linux (x86_64 static - musl)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-linux-x86_64-musl -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/

# Windows (x86_64)
# Download polydup-windows-x86_64.exe from releases page and add to PATH

Node.js/npm

Install as a project dependency or globally:

# Project dependency
npm install @polydup/core

# Global installation
npm install -g @polydup/core

Requirements: Node.js 16+ on macOS (Intel/ARM), Windows (x64), or Linux (x64)

Usage:

const { findDuplicates } = require('@polydup/core');

const duplicates = findDuplicates(
  ['src/', 'tests/'],  // Paths to scan
  10,                  // Minimum block size (lines)
  0.85                 // Similarity threshold (0.0-1.0)
);

console.log(`Found ${duplicates.length} duplicates`);
duplicates.forEach(dup => {
  console.log(`${dup.file1}:${dup.start_line1}  ${dup.file2}:${dup.start_line2}`);
  console.log(`Similarity: ${(dup.similarity * 100).toFixed(1)}%`);
});

Python/pip

Install from PyPI:

# Using pip
pip install polydup

# Using uv (recommended for faster installs)
uv pip install polydup

Requirements: Python 3.8-3.12 on macOS (Intel/ARM), Windows (x64), or Linux (x64)

Usage:

import polydup

# Scan for duplicates
duplicates = polydup.find_duplicates(
    paths=['src/', 'tests/'],
    min_block_size=10,
    similarity_threshold=0.85
)

print(f"Found {len(duplicates)} duplicates")
for dup in duplicates:
    print(f"{dup['file1']}:{dup['start_line1']}{dup['file2']}:{dup['start_line2']}")
    print(f"Similarity: {dup['similarity']*100:.1f}%")

Rust Library

Use the core library in your Rust project:

[dependencies]
polydup-core = "0.1"
use polydup_core::{Scanner, find_duplicates};
use std::path::PathBuf;

fn main() -> anyhow::Result<()> {
    let scanner = Scanner::with_config(10, 0.85)?;
    let report = scanner.scan(vec![PathBuf::from("src")])?;

    println!("Found {} duplicates", report.duplicates.len());
    Ok(())
}

Building from Source

CLI

cargo build --release -p polydup-cli
./target/release/polydup scan ./src

Node.js

cd crates/polydup-node
npm install
npm run build

Python

cd crates/polydup-py
maturin develop
python -c "import polydup; print(polydup.version())"

CLI Usage

Quick Start with polydup init

The fastest way to get started is with the interactive initialization wizard:

# Run the initialization wizard
polydup init

# Non-interactive mode (use defaults)
polydup init --yes

# Force overwrite existing configuration
polydup init --force

The wizard will:

  • Auto-detect your project environment (Node.js, Rust, Python, etc.)
  • Generate .polyduprc.toml with environment-specific defaults
  • Create GitHub Actions workflow (optional)
  • Show install instructions tailored to your environment
  • Provide next steps for local usage

Example workflow:

$ polydup init

PolyDup Initialization Wizard
=============================

Detected environments:
  - Node.js
  - Python

 Select similarity threshold: Standard (0.85)
 Select minimum block size: Medium (50 lines)
 Add custom exclude patterns? · No
 Would you like to create a GitHub Actions workflow? · Yes

Configuration saved to: .polyduprc.toml
GitHub Actions workflow created: .github/workflows/polydup.yml

Next Steps:
  1. Install: npm install -g @polydup/core
  2. Scan: polydup scan ./src

Configuration File (.polyduprc.toml)

After running polydup init, you'll have a .polyduprc.toml file:

[scan]
min_block_size = 50
similarity_threshold = 0.85

[scan.exclude]
patterns = [
    "**/node_modules/**",
    "**/__pycache__/**",
    "**/*.test.js",
    "**/*.test.py",
]

[output]
format = "text"
verbose = false

[ci]
enabled = false
fail_on_duplicates = true

Configuration Discovery:

  • PolyDup searches for .polyduprc.toml in current directory and parent directories
  • CLI arguments override config file settings
  • Perfect for monorepos with shared configuration at root

Basic Commands

# Scan a directory
polydup scan ./src

# Scan multiple directories
polydup scan ./src ./tests ./lib

# Custom threshold (0.0-1.0, higher = stricter)
polydup scan ./src --threshold 0.85

# Adjust minimum block size (lines)
polydup scan ./src --min-block-size 50

# JSON output for scripting
polydup scan ./src --format json > duplicates.json

Examples

Quick scan for severe duplicates:

polydup scan ./src --threshold 0.95 --min-block-size 20

Deep scan for similar code:

polydup scan ./src --threshold 0.70 --min-block-size 5

Scan specific file types:

# PolyDup auto-detects: .rs, .js, .ts, .jsx, .tsx, .py, .vue, .svelte
polydup scan ./src  # Scans all supported languages

CI/CD integration:

# Exit with error if duplicates found
polydup scan ./src --threshold 0.90 || exit 1

Output Formats

Text (default): Human-readable colored output with file paths, line numbers, and similarity scores

JSON: Machine-readable format for scripting and tooling integration

polydup scan ./src --format json | jq '.duplicates | length'

Commands

PolyDup supports the following subcommands:

Command Description Example
scan Scan for duplicate code (default command) polydup scan ./src
init Interactive setup wizard polydup init

Scan Command Options:

The scan command accepts all options listed below. When no subcommand is specified, scan is assumed for backward compatibility.

# These are equivalent:
polydup scan ./src --threshold 0.95
polydup ./src --threshold 0.95

Init Command Options:

Option Description
--yes, -y Skip interactive prompts, use defaults
--force Overwrite existing .polyduprc.toml

CLI Options

Option Type Default Description
--threshold float 0.9 Similarity threshold (0.0-1.0)
--min-block-size int 10 Minimum lines per code block
--format text|json text Output format
--output path - Write report to file
--only-type types - Filter by clone type (type-1, type-2, type-3)
--exclude-type types - Exclude clone types
--group-by criterion - Group results (file, similarity, type, size)
--verbose flag false Show performance statistics
--no-color flag false Disable colored output
--debug flag false Enable debug mode with detailed traces
--enable-type3 flag false Enable Type-3 gap-tolerant detection

Performance Tip: For large codebases (>50K LOC), increase --min-block-size to 20-50 for faster scans with less noise.

Advanced Features

Filtering by Clone Type

Focus on specific types of duplicates for targeted refactoring:

# Show only exact duplicates (highest priority)
polydup scan ./src --only-type type-1

# Show only renamed duplicates
polydup scan ./src --only-type type-2

# Show both Type-1 and Type-2
polydup scan ./src --only-type type-1,type-2

# Exclude noisy Type-3 matches
polydup scan ./src --exclude-type type-3

Use cases:

  • --only-type type-1: Quick wins for immediate refactoring
  • --only-type type-2: Identify abstraction opportunities
  • --exclude-type type-3: Reduce false positives in large codebases

Grouping Results

Organize duplicates for different workflows:

# Group by file (refactoring prioritization)
polydup scan ./src --group-by file

# Group by similarity (quality triage)
polydup scan ./src --group-by similarity

# Group by clone type (targeted cleanup)
polydup scan ./src --group-by type

# Group by size (impact analysis)
polydup scan ./src --group-by size

Grouping strategies:

  • file: See which files need refactoring most
  • similarity: Prioritize high-confidence matches
  • type: Handle Type-1 separately from Type-2
  • size: Focus on large duplicates for maximum impact

Output Options

# Save report to file
polydup scan ./src --output duplicates.txt

# JSON for CI/CD pipelines
polydup scan ./src --format json --output report.json

# Disable colors for logs
polydup scan ./src --no-color

# Or use NO_COLOR environment variable
NO_COLOR=1 polydup scan ./src

# Verbose mode with performance stats
polydup scan ./src --verbose

Debug Mode

Enhanced error messages with actionable suggestions:

# Enable debug mode for troubleshooting
polydup scan ./src --debug

# Debug mode shows:
# - Current working directory
# - File access permissions
# - Parser errors with context
# - Configuration validation details

Example error output:

Error: Path does not exist: /nonexistent/path

Suggestion: Check the path spelling and ensure it exists
  Example: polydup scan ./src
           polydup scan /absolute/path/to/project

Debug Info: Current directory: /Users/you/project

Combining Features

Mix and match for powerful workflows:

# High-priority refactoring targets
polydup scan ./src \
  --only-type type-1 \
  --group-by file \
  --min-block-size 50 \
  --output refactor-priorities.txt

# CI/CD duplicate gate
polydup scan ./src \
  --threshold 0.95 \
  --exclude-type type-3 \
  --format json \
  --output duplicates.json

# Deep analysis with verbose stats
polydup scan ./src \
  --enable-type3 \
  --group-by similarity \
  --verbose

# Quick triage without noise
polydup scan ./src \
  --only-type type-1,type-2 \
  --group-by type \
  --no-color

Dashboard Output

PolyDup provides a professional dashboard with actionable insights:

╔═══════════════════════════════════════════════════════════╗
║                      Scan Results                         ║
╠═══════════════════════════════════════════════════════════╣
║ Files scanned:       142                                  ║
║ Functions analyzed:  287                                  ║
║ Duplicates found:    15                                   ║
║ Estimated savings:   ~450 lines                           ║
╠═══════════════════════════════════════════════════════════╣
║ Clone Type Breakdown:                                     ║
║   Type-1 (exact):    5 groups  │ Critical priority       ║
║   Type-2 (renamed):  8 groups  │ High priority           ║
║   Type-3 (modified): 2 groups  │ Medium priority         ║
╠═══════════════════════════════════════════════════════════╣
║ Top Offenders:                                            ║
║   1. src/handlers.ts      8 duplicates                    ║
║   2. lib/utils.ts         5 duplicates                    ║
║   3. components/Form.tsx  3 duplicates                    ║
╚═══════════════════════════════════════════════════════════╝

Duplicate #1 (Type-2: Renamed identifiers)
  Location: src/auth.ts:45-68 ↔ src/admin.ts:120-143
  Similarity: 94.2% | Length: 24 lines
  ...

Dashboard features:

  • Lines saved estimation: Potential code reduction
  • Top offenders: Files needing most attention
  • Similarity range: Quality distribution (min-max)
  • Priority labels: Critical (Type-1), High (Type-2), Medium (Type-3)

Exit Codes

PolyDup uses semantic exit codes for CI/CD integration:

Exit Code Meaning Use Case
0 No duplicates found Clean codebase ✓
1 Duplicates detected Quality gate (expected)
2 Error occurred Configuration/runtime issue

CI/CD examples:

# Fail build if duplicates found
polydup scan ./src || exit 1

# Warning only (report but don't fail)
polydup scan ./src || true

# Strict quality gate (fail on any duplicates)
if polydup scan ./src --threshold 0.95; then
  echo "No duplicates found"
else
  echo "⚠️ Duplicates detected - please refactor"
  exit 1
fi

Supported Languages

  • JavaScript/TypeScript: .js, .jsx, .ts, .tsx
  • Python: .py
  • Rust: .rs
  • Vue: .vue
  • Svelte: .svelte

More languages coming soon (Java, Go, C/C++, Ruby, PHP)

Clone Types

PolyDup classifies duplicates into different types based on the International Workshop on Software Clones (IWSC) taxonomy:

Type-1: Exact Clones

Identical code fragments except for whitespace, comments, and formatting.

Example:

// File 1
function calculateTotal(items) {
    let sum = 0;
    for (let i = 0; i < items.length; i++) {
        sum += items[i].price;
    }
    return sum;
}

// File 2 (Type-1 clone - only formatting differs)
function calculateTotal(items) {
  let sum = 0;
  for (let i = 0; i < items.length; i++) { sum += items[i].price; }
  return sum;
}

Why they exist: Direct copy-paste without any modifications.

Type-2: Renamed/Parameterized Clones

Structurally identical code with renamed identifiers, changed literals, or different types.

Example:

// File 1
function calculateTotal(items) {
    let sum = 0;
    for (let i = 0; i < items.length; i++) {
        sum += items[i].price;
    }
    return sum;
}

// File 2 (Type-2 clone - renamed variables, same logic)
function computeSum(products) {
    let total = 0;
    for (let j = 0; j < products.length; j++) {
        total += products[j].cost;
    }
    return total;
}

Why they exist: Copy-paste-modify pattern where developers adapt code for different contexts.

Detection: PolyDup normalizes identifiers and literals (e.g., sum@@ID, 0@@NUM) to detect structural similarity.

Type-3: Near-Miss Clones (Not Yet Implemented)

Similar code with minor modifications like inserted/deleted statements or changed expressions.

Example:

// File 1
function processOrder(order) {
    validateOrder(order);
    let total = calculateTotal(order.items);
    applyDiscount(total, order.coupon);
    return total;
}

// File 2 (Type-3 clone - added logging, changed discount logic)
function processOrder(order) {
    validateOrder(order);
    console.log("Processing order:", order.id);  // ADDED
    let total = calculateTotal(order.items);
    let discount = order.coupon ? 0.1 : 0;      // MODIFIED
    total *= (1 - discount);                     // MODIFIED
    return total;
}

Why they exist: Code evolution, bug fixes, or feature additions that slightly modify duplicated logic.

Type-4: Semantic Clones (Not Yet Implemented)

Functionally equivalent code with different implementations.

Example:

// File 1 - Imperative loop
function sum(arr) {
    let total = 0;
    for (let i = 0; i < arr.length; i++) {
        total += arr[i];
    }
    return total;
}

// File 2 - Functional approach
function sum(arr) {
    return arr.reduce((acc, val) => acc + val, 0);
}

// File 3 - Recursive
function sum(arr, i = 0) {
    if (i >= arr.length) return 0;
    return arr[i] + sum(arr, i + 1);
}

Why they exist: Different programming paradigms or styles achieving the same result.

Detection Challenge: Requires semantic analysis, control-flow graphs, or ML-based approaches.

Understanding Your Results

When PolyDup reports duplicates, the clone type indicates:

  • Type-1: Exact copy-paste → Quick win for extraction into shared utilities
  • Type-2: Adapted copy-paste → Candidate for parameterized functions or generics
  • Type-3: Modified duplicates → May require refactoring with strategy patterns
  • Type-4: Semantic equivalence → Consider standardizing on one implementation

Typical Real-World Distribution:

  • Type-1: 5-10% (rare in mature codebases)
  • Type-2: 60-70% (most common - copy-paste-modify)
  • Type-3: 20-30% (evolved duplicates)
  • Type-4: <5% (requires specialized detection)

Performance Note: PolyDup efficiently handles codebases up to 100K LOC. Tested on real-world projects with detection times under 5 seconds for most repos.

Troubleshooting

Common Issues

"No duplicates found" but you expect some

Possible causes:

  • Threshold too high: Try lowering --threshold to 0.70-0.80
  • Block size too large: Reduce --min-block-size to 5-10 lines
  • Type-3 not enabled: Add --enable-type3 for gap-tolerant matching
# More sensitive scan
polydup scan ./src --threshold 0.70 --min-block-size 5 --enable-type3

"Too many false positives"

Solutions:

  • Increase threshold: Use --threshold 0.95 for high-confidence matches
  • Exclude Type-3: Add --exclude-type type-3 to remove noisy matches
  • Increase block size: Use --min-block-size 50 for substantial duplicates only
# Strict, high-quality scan
polydup scan ./src --threshold 0.95 --exclude-type type-3 --min-block-size 50

"Permission denied" errors

Fix:

# Check file permissions
ls -la /path/to/scan

# Run with proper permissions
chmod +r /path/to/files

# Use --debug to see detailed error info
polydup scan ./src --debug

"Unsupported file type" warnings

Explanation: PolyDup currently supports JavaScript, TypeScript, Python, Rust, Vue, and Svelte. Other file types are skipped automatically.

Workaround:

Colors not working in CI/CD

Solution:

# Disable colors explicitly
polydup scan ./src --no-color

# Or use environment variable
NO_COLOR=1 polydup scan ./src

"Out of memory" on large codebases

Solutions:

# Increase minimum block size to reduce memory usage
polydup scan ./src --min-block-size 100

# Scan directories separately
polydup scan ./src
polydup scan ./tests
polydup scan ./lib

# Exclude generated/vendor code
# Create .polyduprc.toml with exclude patterns

Performance Tips

For large codebases (>50K LOC):

  • Use --min-block-size 50-100 to focus on substantial duplicates
  • Disable Type-3 detection (it's more computationally expensive)
  • Use --exclude-type type-3 to skip gap-tolerant matching
  • Increase --threshold to 0.95 to reduce candidate matches

For monorepos:

  • Create .polyduprc.toml at root with shared configuration
  • Use --group-by file to organize results by module
  • Exclude node_modules, dist, target, etc. in config

For CI/CD:

  • Cache the polydup binary to speed up pipeline
  • Use --format json for machine-readable output
  • Set appropriate exit code handling (0=clean, 1=duplicates, 2=error)

Getting Help

Debug Mode:

# Enable detailed error traces
polydup scan ./src --debug

Verbose Output:

# Show performance statistics
polydup scan ./src --verbose

Report an Issue:

  1. Check existing issues
  2. Include:
    • PolyDup version (polydup --version)
    • Operating system and architecture
    • Command that failed
    • Error message with --debug flag
    • Sample code if applicable (anonymized)

Community:

Development

Building from Source

Prerequisites:

  • Rust 1.70+ (curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh)
  • Node.js 16+ (for Node.js bindings)
  • Python 3.8-3.12 (for Python bindings)

CLI:

git clone https://github.com/wiesnerbernard/polydup.git
cd polydup
cargo build --release -p polydup-cli
./target/release/polydup scan ./src

Node.js bindings:

cd crates/polydup-node
npm install
npm run build
npm test

Python bindings:

cd crates/polydup-py
pip install maturin
maturin develop
python -c "import polydup; print(polydup.version())"

Run tests:

# All tests
cargo test --workspace

# Specific crate
cargo test -p polydup-core

# With coverage
cargo install cargo-tarpaulin
cargo tarpaulin --workspace

Creating a Release

Recommended: Create releases directly from GitHub UI - fully automated, no local tools required!

  1. Go to Releases → New Release
  2. Create a new tag (e.g., v0.2.7)
  3. Click "Publish release"
  4. Everything happens automatically (~5-7 minutes):
    • Syncs version files (Cargo.toml, package.json, pyproject.toml)
    • Updates CHANGELOG.md with release entry
    • Moves tag to version-synced commit (if needed)
    • Builds binaries for all 5 platforms (macOS/Linux/Windows)
    • Publishes to crates.io, npm, and PyPI
    • Creates release with binary assets
    • Zero manual steps required - truly one-click releases!

Alternative: Use the release script locally:

./scripts/release.sh 0.2.5

See docs/RELEASE.md for detailed instructions.

Pre-commit Hooks

Install pre-commit hooks to automatically run linting and tests:

# Install pre-commit (if not already installed)
pip install pre-commit

# Install the git hooks
pre-commit install
pre-commit install -t pre-push

# Run manually on all files
pre-commit run --all-files

The hooks will automatically run:

  • On commit: cargo fmt, cargo clippy, file checks (trailing whitespace, YAML/TOML validation)
  • On push: Full test suite with cargo test

To skip hooks temporarily:

git commit --no-verify

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Install pre-commit hooks (pre-commit install)
  4. Make your changes and ensure tests pass (cargo test --workspace)
  5. Run clippy (cargo clippy --workspace --all-targets -- -D warnings)
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

See CONTRIBUTING.md for detailed guidelines.

License

MIT OR Apache-2.0