polydup-core 0.1.3

Cross-language duplicate code detection library using Tree-sitter and Rabin-Karp
Documentation

PolyDup

Crates.io npm PyPI CI License

Cross-language duplicate code detector powered by Tree-sitter and Rust.

Features

  • Blazing Fast: Parallel processing with Rabin-Karp rolling hash algorithm
  • Cross-Language: JavaScript, TypeScript, Python, Rust, Vue, Svelte (more coming)
  • Accurate: Tree-sitter AST parsing for semantic-aware detection
  • Multi-Platform: CLI, Node.js npm package, Python pip package, Rust library
  • Configurable: Adjust thresholds and block sizes for your needs
  • Efficient: Zero-copy FFI bindings for minimal overhead

Architecture

Shared Core Architecture: All duplicate detection logic lives in Rust, exposed via FFI bindings.

┌─────────────────────────────────────────────┐
│           polydup-core (Rust)               │
│  • Tree-sitter parsing                      │
│  • Rabin-Karp hashing                       │
│  • Parallel file scanning                   │
│  • Duplicate detection                      │
└─────────────────────────────────────────────┘
          ▲          ▲          ▲
          │          │          │
    ┌─────┴───┐  ┌───┴────┐  ┌─┴─────┐
    │ CLI     │  │ Node.js│  │ Python│
    │ (Rust)  │  │(napi-rs)│  │(PyO3) │
    └─────────┘  └────────┘  └───────┘

Crates:

  • polydup-core: Pure Rust library with Tree-sitter parsing, hashing, and reporting
  • polydup-cli: Standalone CLI tool (cargo install polydup-cli)
  • polydup-node: Node.js native addon via napi-rs (npm install @polydup/core)
  • polydup-py: Python extension module via PyO3 (pip install polydup)

Installation

Rust CLI (Recommended)

The fastest way to use PolyDup is via the CLI tool:

# Install from crates.io
cargo install polydup-cli

# Verify installation
polydup --version

# Scan for duplicates
polydup scan ./src

System Requirements:

  • Rust 1.70+ (if building from source)
  • macOS, Linux, or Windows

Pre-built Binaries:

Download pre-compiled binaries from GitHub Releases:

# macOS (Apple Silicon)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-aarch64-apple-darwin.tar.gz | tar xz
sudo mv polydup /usr/local/bin/

# macOS (Intel)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-x86_64-apple-darwin.tar.gz | tar xz
sudo mv polydup /usr/local/bin/

# Linux (x86_64)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-x86_64-unknown-linux-gnu.tar.gz | tar xz
sudo mv polydup /usr/local/bin/

# Windows (x86_64)
# Download from releases page and add to PATH

Node.js/npm

Install as a project dependency or globally:

# Project dependency
npm install @polydup/core

# Global installation
npm install -g @polydup/core

Requirements: Node.js 16+ on macOS (Intel/ARM), Windows (x64), or Linux (x64)

Usage:

const { findDuplicates } = require('@polydup/core');

const duplicates = findDuplicates(
  ['src/', 'tests/'],  // Paths to scan
  10,                  // Minimum block size (lines)
  0.85                 // Similarity threshold (0.0-1.0)
);

console.log(`Found ${duplicates.length} duplicates`);
duplicates.forEach(dup => {
  console.log(`${dup.file1}:${dup.start_line1}  ${dup.file2}:${dup.start_line2}`);
  console.log(`Similarity: ${(dup.similarity * 100).toFixed(1)}%`);
});

Python/pip

Install from PyPI:

# Using pip
pip install polydup

# Using uv (recommended for faster installs)
uv pip install polydup

Requirements: Python 3.8-3.12 on macOS (Intel/ARM), Windows (x64), or Linux (x64)

Usage:

import polydup

# Scan for duplicates
duplicates = polydup.find_duplicates(
    paths=['src/', 'tests/'],
    min_block_size=10,
    similarity_threshold=0.85
)

print(f"Found {len(duplicates)} duplicates")
for dup in duplicates:
    print(f"{dup['file1']}:{dup['start_line1']}{dup['file2']}:{dup['start_line2']}")
    print(f"Similarity: {dup['similarity']*100:.1f}%")

Rust Library

Use the core library in your Rust project:

[dependencies]
polydup-core = "0.1"
use polydup_core::{Scanner, find_duplicates};
use std::path::PathBuf;

fn main() -> anyhow::Result<()> {
    let scanner = Scanner::with_config(10, 0.85)?;
    let report = scanner.scan(vec![PathBuf::from("src")])?;

    println!("Found {} duplicates", report.duplicates.len());
    Ok(())
}

Building from Source

CLI

cargo build --release -p polydup-cli
./target/release/polydup scan ./src

Node.js

cd crates/polydup-node
npm install
npm run build

Python

cd crates/polydup-py
maturin develop
python -c "import polydup; print(polydup.version())"

CLI Usage

Basic Commands

# Scan a directory
polydup scan ./src

# Scan multiple directories
polydup scan ./src ./tests ./lib

# Custom threshold (0.0-1.0, higher = stricter)
polydup scan ./src --threshold 0.85

# Adjust minimum block size (lines)
polydup scan ./src --min-block-size 50

# JSON output for scripting
polydup scan ./src --format json > duplicates.json

Examples

Quick scan for severe duplicates:

polydup scan ./src --threshold 0.95 --min-block-size 20

Deep scan for similar code:

polydup scan ./src --threshold 0.70 --min-block-size 5

Scan specific file types:

# PolyDup auto-detects: .rs, .js, .ts, .jsx, .tsx, .py, .vue, .svelte
polydup scan ./src  # Scans all supported languages

CI/CD integration:

# Exit with error if duplicates found
polydup scan ./src --threshold 0.90 || exit 1

Output Formats

Text (default): Human-readable colored output with file paths, line numbers, and similarity scores

JSON: Machine-readable format for scripting and tooling integration

polydup scan ./src --format json | jq '.duplicates | length'

CLI Options

Option Type Default Description
--threshold float 0.9 Similarity threshold (0.0-1.0)
--min-block-size int 10 Minimum lines per code block
--format text|json text Output format

Supported Languages

  • JavaScript/TypeScript: .js, .jsx, .ts, .tsx
  • Python: .py
  • Rust: .rs
  • Vue: .vue
  • Svelte: .svelte

More languages coming soon (Java, Go, C/C++, Ruby, PHP)

Development

Building from Source

Prerequisites:

  • Rust 1.70+ (curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh)
  • Node.js 16+ (for Node.js bindings)
  • Python 3.8-3.12 (for Python bindings)

CLI:

git clone https://github.com/wiesnerbernard/polydup.git
cd polydup
cargo build --release -p polydup-cli
./target/release/polydup scan ./src

Node.js bindings:

cd crates/polydup-node
npm install
npm run build
npm test

Python bindings:

cd crates/polydup-py
pip install maturin
maturin develop
python -c "import polydup; print(polydup.version())"

Run tests:

# All tests
cargo test --workspace

# Specific crate
cargo test -p polydup-core

# With coverage
cargo install cargo-tarpaulin
cargo tarpaulin --workspace

Pre-commit Hooks

Install pre-commit hooks to automatically run linting and tests:

# Install pre-commit (if not already installed)
pip install pre-commit

# Install the git hooks
pre-commit install
pre-commit install -t pre-push

# Run manually on all files
pre-commit run --all-files

The hooks will automatically run:

  • On commit: cargo fmt, cargo clippy, file checks (trailing whitespace, YAML/TOML validation)
  • On push: Full test suite with cargo test

To skip hooks temporarily:

git commit --no-verify

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Install pre-commit hooks (pre-commit install)
  4. Make your changes and ensure tests pass (cargo test --workspace)
  5. Run clippy (cargo clippy --workspace --all-targets -- -D warnings)
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

See CONTRIBUTING.md for detailed guidelines.

License

MIT OR Apache-2.0