polydup-core 0.2.2

Cross-language duplicate code detection library using Tree-sitter and Rabin-Karp
Documentation

PolyDup

Crates.io npm PyPI CI License

Cross-language duplicate code detector powered by Tree-sitter and Rust.

Features

  • Blazing Fast: Parallel processing with Rabin-Karp rolling hash algorithm
  • Cross-Language: JavaScript, TypeScript, Python, Rust, Vue, Svelte (more coming)
  • Accurate: Tree-sitter AST parsing for semantic-aware detection
  • Multi-Platform: CLI, Node.js npm package, Python pip package, Rust library
  • Configurable: Adjust thresholds and block sizes for your needs
  • Efficient: Zero-copy FFI bindings for minimal overhead

Architecture

Shared Core Architecture: All duplicate detection logic lives in Rust, exposed via FFI bindings.

┌─────────────────────────────────────────────┐
│           polydup-core (Rust)               │
│  • Tree-sitter parsing                      │
│  • Rabin-Karp hashing                       │
│  • Parallel file scanning                   │
│  • Duplicate detection                      │
└─────────────────────────────────────────────┘
          ▲          ▲          ▲
          │          │          │
    ┌─────┴───┐  ┌───┴────┐  ┌─┴─────┐
    │ CLI     │  │ Node.js│  │ Python│
    │ (Rust)  │  │(napi-rs)│  │(PyO3) │
    └─────────┘  └────────┘  └───────┘

Crates:

  • polydup-core: Pure Rust library with Tree-sitter parsing, hashing, and reporting
  • polydup-cli: Standalone CLI tool (cargo install polydup-cli)
  • polydup-node: Node.js native addon via napi-rs (npm install @polydup/core)
  • polydup-py: Python extension module via PyO3 (pip install polydup)

Installation

Rust CLI (Recommended)

The fastest way to use PolyDup is via the CLI tool:

# Install from crates.io
cargo install polydup-cli

# Verify installation
polydup --version

# Scan for duplicates
polydup scan ./src

System Requirements:

  • Rust 1.70+ (if building from source)
  • macOS, Linux, or Windows

Pre-built Binaries:

Download pre-compiled binaries from GitHub Releases:

# macOS (Apple Silicon)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-macos-aarch64 -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/

# macOS (Intel)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-macos-x86_64 -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/

# Linux (x86_64)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-linux-x86_64 -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/

# Linux (x86_64 static - musl)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-linux-x86_64-musl -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/

# Windows (x86_64)
# Download polydup-windows-x86_64.exe from releases page and add to PATH

Node.js/npm

Install as a project dependency or globally:

# Project dependency
npm install @polydup/core

# Global installation
npm install -g @polydup/core

Requirements: Node.js 16+ on macOS (Intel/ARM), Windows (x64), or Linux (x64)

Usage:

const { findDuplicates } = require('@polydup/core');

const duplicates = findDuplicates(
  ['src/', 'tests/'],  // Paths to scan
  10,                  // Minimum block size (lines)
  0.85                 // Similarity threshold (0.0-1.0)
);

console.log(`Found ${duplicates.length} duplicates`);
duplicates.forEach(dup => {
  console.log(`${dup.file1}:${dup.start_line1}  ${dup.file2}:${dup.start_line2}`);
  console.log(`Similarity: ${(dup.similarity * 100).toFixed(1)}%`);
});

Python/pip

Install from PyPI:

# Using pip
pip install polydup

# Using uv (recommended for faster installs)
uv pip install polydup

Requirements: Python 3.8-3.12 on macOS (Intel/ARM), Windows (x64), or Linux (x64)

Usage:

import polydup

# Scan for duplicates
duplicates = polydup.find_duplicates(
    paths=['src/', 'tests/'],
    min_block_size=10,
    similarity_threshold=0.85
)

print(f"Found {len(duplicates)} duplicates")
for dup in duplicates:
    print(f"{dup['file1']}:{dup['start_line1']}{dup['file2']}:{dup['start_line2']}")
    print(f"Similarity: {dup['similarity']*100:.1f}%")

Rust Library

Use the core library in your Rust project:

[dependencies]
polydup-core = "0.1"
use polydup_core::{Scanner, find_duplicates};
use std::path::PathBuf;

fn main() -> anyhow::Result<()> {
    let scanner = Scanner::with_config(10, 0.85)?;
    let report = scanner.scan(vec![PathBuf::from("src")])?;

    println!("Found {} duplicates", report.duplicates.len());
    Ok(())
}

Building from Source

CLI

cargo build --release -p polydup-cli
./target/release/polydup scan ./src

Node.js

cd crates/polydup-node
npm install
npm run build

Python

cd crates/polydup-py
maturin develop
python -c "import polydup; print(polydup.version())"

CLI Usage

Quick Start with polydup init

The fastest way to get started is with the interactive initialization wizard:

# Run the initialization wizard
polydup init

# Non-interactive mode (use defaults)
polydup init --yes

# Force overwrite existing configuration
polydup init --force

The wizard will:

  • Auto-detect your project environment (Node.js, Rust, Python, etc.)
  • Generate .polyduprc.toml with environment-specific defaults
  • Create GitHub Actions workflow (optional)
  • Show install instructions tailored to your environment
  • Provide next steps for local usage

Example workflow:

$ polydup init

PolyDup Initialization Wizard
=============================

Detected environments:
  - Node.js
  - Python

 Select similarity threshold: Standard (0.85)
 Select minimum block size: Medium (50 lines)
 Add custom exclude patterns? · No
 Would you like to create a GitHub Actions workflow? · Yes

Configuration saved to: .polyduprc.toml
GitHub Actions workflow created: .github/workflows/polydup.yml

Next Steps:
  1. Install: npm install -g @polydup/core
  2. Scan: polydup scan ./src

Configuration File (.polyduprc.toml)

After running polydup init, you'll have a .polyduprc.toml file:

[scan]
min_block_size = 50
similarity_threshold = 0.85

[scan.exclude]
patterns = [
    "**/node_modules/**",
    "**/__pycache__/**",
    "**/*.test.js",
    "**/*.test.py",
]

[output]
format = "text"
verbose = false

[ci]
enabled = false
fail_on_duplicates = true

Configuration Discovery:

  • PolyDup searches for .polyduprc.toml in current directory and parent directories
  • CLI arguments override config file settings
  • Perfect for monorepos with shared configuration at root

Basic Commands

# Scan a directory
polydup scan ./src

# Scan multiple directories
polydup scan ./src ./tests ./lib

# Custom threshold (0.0-1.0, higher = stricter)
polydup scan ./src --threshold 0.85

# Adjust minimum block size (lines)
polydup scan ./src --min-block-size 50

# JSON output for scripting
polydup scan ./src --format json > duplicates.json

Examples

Quick scan for severe duplicates:

polydup scan ./src --threshold 0.95 --min-block-size 20

Deep scan for similar code:

polydup scan ./src --threshold 0.70 --min-block-size 5

Scan specific file types:

# PolyDup auto-detects: .rs, .js, .ts, .jsx, .tsx, .py, .vue, .svelte
polydup scan ./src  # Scans all supported languages

CI/CD integration:

# Exit with error if duplicates found
polydup scan ./src --threshold 0.90 || exit 1

Output Formats

Text (default): Human-readable colored output with file paths, line numbers, and similarity scores

JSON: Machine-readable format for scripting and tooling integration

polydup scan ./src --format json | jq '.duplicates | length'

CLI Options

Option Type Default Description
--threshold float 0.9 Similarity threshold (0.0-1.0)
--min-block-size int 10 Minimum lines per code block
--format text|json text Output format

Supported Languages

  • JavaScript/TypeScript: .js, .jsx, .ts, .tsx
  • Python: .py
  • Rust: .rs
  • Vue: .vue
  • Svelte: .svelte

More languages coming soon (Java, Go, C/C++, Ruby, PHP)

Clone Types

PolyDup classifies duplicates into different types based on the International Workshop on Software Clones (IWSC) taxonomy:

Type-1: Exact Clones

Identical code fragments except for whitespace, comments, and formatting.

Example:

// File 1
function calculateTotal(items) {
    let sum = 0;
    for (let i = 0; i < items.length; i++) {
        sum += items[i].price;
    }
    return sum;
}

// File 2 (Type-1 clone - only formatting differs)
function calculateTotal(items) {
  let sum = 0;
  for (let i = 0; i < items.length; i++) { sum += items[i].price; }
  return sum;
}

Why they exist: Direct copy-paste without any modifications.

Type-2: Renamed/Parameterized Clones

Structurally identical code with renamed identifiers, changed literals, or different types.

Example:

// File 1
function calculateTotal(items) {
    let sum = 0;
    for (let i = 0; i < items.length; i++) {
        sum += items[i].price;
    }
    return sum;
}

// File 2 (Type-2 clone - renamed variables, same logic)
function computeSum(products) {
    let total = 0;
    for (let j = 0; j < products.length; j++) {
        total += products[j].cost;
    }
    return total;
}

Why they exist: Copy-paste-modify pattern where developers adapt code for different contexts.

Detection: PolyDup normalizes identifiers and literals (e.g., sum@@ID, 0@@NUM) to detect structural similarity.

Type-3: Near-Miss Clones (Not Yet Implemented)

Similar code with minor modifications like inserted/deleted statements or changed expressions.

Example:

// File 1
function processOrder(order) {
    validateOrder(order);
    let total = calculateTotal(order.items);
    applyDiscount(total, order.coupon);
    return total;
}

// File 2 (Type-3 clone - added logging, changed discount logic)
function processOrder(order) {
    validateOrder(order);
    console.log("Processing order:", order.id);  // ADDED
    let total = calculateTotal(order.items);
    let discount = order.coupon ? 0.1 : 0;      // MODIFIED
    total *= (1 - discount);                     // MODIFIED
    return total;
}

Why they exist: Code evolution, bug fixes, or feature additions that slightly modify duplicated logic.

Type-4: Semantic Clones (Not Yet Implemented)

Functionally equivalent code with different implementations.

Example:

// File 1 - Imperative loop
function sum(arr) {
    let total = 0;
    for (let i = 0; i < arr.length; i++) {
        total += arr[i];
    }
    return total;
}

// File 2 - Functional approach
function sum(arr) {
    return arr.reduce((acc, val) => acc + val, 0);
}

// File 3 - Recursive
function sum(arr, i = 0) {
    if (i >= arr.length) return 0;
    return arr[i] + sum(arr, i + 1);
}

Why they exist: Different programming paradigms or styles achieving the same result.

Detection Challenge: Requires semantic analysis, control-flow graphs, or ML-based approaches.

Understanding Your Results

When PolyDup reports duplicates, the clone type indicates:

  • Type-1: Exact copy-paste → Quick win for extraction into shared utilities
  • Type-2: Adapted copy-paste → Candidate for parameterized functions or generics
  • Type-3: Modified duplicates → May require refactoring with strategy patterns
  • Type-4: Semantic equivalence → Consider standardizing on one implementation

Typical Real-World Distribution:

  • Type-1: 5-10% (rare in mature codebases)
  • Type-2: 60-70% (most common - copy-paste-modify)
  • Type-3: 20-30% (evolved duplicates)
  • Type-4: <5% (requires specialized detection)

Development

Building from Source

Prerequisites:

  • Rust 1.70+ (curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh)
  • Node.js 16+ (for Node.js bindings)
  • Python 3.8-3.12 (for Python bindings)

CLI:

git clone https://github.com/wiesnerbernard/polydup.git
cd polydup
cargo build --release -p polydup-cli
./target/release/polydup scan ./src

Node.js bindings:

cd crates/polydup-node
npm install
npm run build
npm test

Python bindings:

cd crates/polydup-py
pip install maturin
maturin develop
python -c "import polydup; print(polydup.version())"

Run tests:

# All tests
cargo test --workspace

# Specific crate
cargo test -p polydup-core

# With coverage
cargo install cargo-tarpaulin
cargo tarpaulin --workspace

Pre-commit Hooks

Install pre-commit hooks to automatically run linting and tests:

# Install pre-commit (if not already installed)
pip install pre-commit

# Install the git hooks
pre-commit install
pre-commit install -t pre-push

# Run manually on all files
pre-commit run --all-files

The hooks will automatically run:

  • On commit: cargo fmt, cargo clippy, file checks (trailing whitespace, YAML/TOML validation)
  • On push: Full test suite with cargo test

To skip hooks temporarily:

git commit --no-verify

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Install pre-commit hooks (pre-commit install)
  4. Make your changes and ensure tests pass (cargo test --workspace)
  5. Run clippy (cargo clippy --workspace --all-targets -- -D warnings)
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

See CONTRIBUTING.md for detailed guidelines.

License

MIT OR Apache-2.0