polydup-core 0.8.1

Cross-language duplicate code detection library using Tree-sitter and Rabin-Karp
Documentation

PolyDup

Crates.io npm PyPI GitHub Action CI Coverage Tests License

Cross-language duplicate code detector powered by Tree-sitter and Rust.

Features

  • Blazing Fast: Parallel processing with Rabin-Karp rolling hash algorithm (up to 10x faster than regex-based detectors)
  • Cross-Language: JavaScript, TypeScript, Python, Rust, Vue, Svelte (more coming)
  • Accurate: Tree-sitter AST parsing for semantic-aware detection (eliminates false positives from comments/strings)
  • Multi-Platform: CLI, Node.js npm package, Python pip package, Rust library
  • Configurable: Adjust thresholds and block sizes for your needs
  • Efficient: Zero-copy FFI bindings for minimal overhead (passes file paths, not contents)

Architecture

Shared Core Architecture: All duplicate detection logic lives in Rust, exposed via FFI bindings.

┌─────────────────────────────────────────────┐
│           polydup-core (Rust)               │
│  • Tree-sitter parsing                      │
│  • Rabin-Karp hashing                       │
│  • Parallel file scanning                   │
│  • Duplicate detection                      │
└─────────────────────────────────────────────┘
          ▲          ▲          ▲
          │          │          │
    ┌─────┴───┐  ┌───┴────┐  ┌─┴─────┐
    │ CLI     │  │ Node.js│  │ Python│
    │ (Rust)  │  │(napi-rs)│  │(PyO3) │
    └─────────┘  └────────┘  └───────┘

Crates:

  • polydup-core: Pure Rust library with Tree-sitter parsing, hashing, and reporting
  • polydup (CLI): Standalone CLI tool (cargo install polydup)
  • polydup-node: Node.js library bindings via napi-rs (npm install @polydup/core)
  • polydup-py: Python library bindings via PyO3 (pip install polydup)

Installation

Important: PolyDup is available in multiple forms for different use cases:

  • CLI Tool: cargo install polydup - Command-line scanning
  • Python Library: pip install polydup - Python API bindings (NOT a CLI)
  • Node.js Library: npm install @polydup/core - Node.js API bindings (NOT a CLI)

If you want to run polydup from the command line, use cargo install polydup.

GitHub Action (Easiest for CI/CD) 🚀

The fastest way to add duplicate detection to your workflow:

name: Code Quality

on:
  pull_request:
    branches: [ main ]

permissions:
  contents: read
  pull-requests: write  # Required for PR comments

jobs:
  duplicate-detection:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Required for git-diff mode

      - uses: wiesnerbernard/polydup-action@v0.2.1
        with:
          threshold: 50
          similarity: '0.85'
          fail-on-duplicates: true
          github-token: ${{ secrets.GITHUB_TOKEN }}

Benefits:

  • 🚀 10-100x faster (only scans changed files in PR)
  • 💬 Automatic PR comments with duplicate reports
  • ✅ Zero configuration needed
  • 🔒 Secure (no data leaves your repository)

Action Inputs

Input Default Description
threshold 50 Minimum code block size in tokens
similarity 0.85 Similarity threshold (0.0-1.0)
fail-on-duplicates true Fail the check if duplicates found
format text Output format: text or json
base-ref auto Base git reference (auto-detects from PR)
github-token - Token for PR comments
comment-on-pr true Post results as PR comment

Action Outputs

Output Description
duplicates-found Number of duplicate code blocks found
files-scanned Number of files scanned
exit-code Exit code (0 = no duplicates, 1 = duplicates)

Using Outputs in Workflows

- uses: wiesnerbernard/polydup-action@v0.2.1
  id: polydup
  with:
    fail-on-duplicates: false
    github-token: ${{ secrets.GITHUB_TOKEN }}

- name: Check results
  run: |
    echo "Files scanned: ${{ steps.polydup.outputs.files-scanned }}"
    echo "Duplicates found: ${{ steps.polydup.outputs.duplicates-found }}"
    if [ "${{ steps.polydup.outputs.duplicates-found }}" -gt 10 ]; then
      echo "Too many duplicates!"
      exit 1
    fi

Example PR Comment

When duplicates are found, the action posts a comment like:

## PolyDup Duplicate Code Report

**Found 3 duplicate code block(s)**

- Files scanned: 12
- Threshold: 50 tokens
- Similarity: 0.85

<details>
<summary>View Details</summary>
[Detailed scan output...]
</details>

**Tip**: Consider refactoring duplicated code to improve maintainability.

See polydup-action for full documentation.


Rust CLI (Recommended for Local Development)

The fastest way to use PolyDup locally is via the CLI tool:

# Install from crates.io
cargo install polydup

# Verify installation
polydup --version

# Scan for duplicates
polydup scan ./src

System Requirements:

  • Rust 1.70+ (if building from source)
  • macOS, Linux, or Windows

Note: Homebrew tap coming soon! (brew install polydup)

Pre-built Binaries:

Download pre-compiled binaries from GitHub Releases:

# macOS (Apple Silicon)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-macos-aarch64 -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/

# macOS (Intel)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-macos-x86_64 -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/

# Linux (x86_64)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-linux-x86_64 -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/

# Linux (x86_64 static - musl)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-linux-x86_64-musl -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/

# Windows (x86_64)
# Download polydup-windows-x86_64.exe from releases page and add to PATH

Node.js/npm (Library Only)

Note: This is a library package for integrating duplicate detection into Node.js applications. It does NOT provide a CLI. For command-line usage, use cargo install polydup.

Install as a project dependency:

npm install @polydup/core

Requirements: Node.js 16+ on macOS (Intel/ARM), Windows (x64), or Linux (x64)

Usage:

const { findDuplicates } = require('@polydup/core');

const duplicates = findDuplicates(
  ['src/', 'tests/'],  // Paths to scan
  10,                  // Minimum block size (lines)
  0.85                 // Similarity threshold (0.0-1.0)
);

console.log(`Found ${duplicates.length} duplicates`);
duplicates.forEach(dup => {
  console.log(`${dup.file1}:${dup.start_line1}  ${dup.file2}:${dup.start_line2}`);
  console.log(`Similarity: ${(dup.similarity * 100).toFixed(1)}%`);
});

Python/pip (Library Only)

Note: This is a library package for integrating duplicate detection into Python applications. It does NOT provide a CLI. For command-line usage, use cargo install polydup.

Running python -m polydup will display installation guidance.

Install from PyPI:

# Using pip
pip install polydup

# Using uv (recommended for faster installs)
uv pip install polydup

Requirements: Python 3.8-3.12 on macOS (Intel/ARM), Windows (x64), or Linux (x64)

Usage:

import polydup

# Scan for duplicates
duplicates = polydup.find_duplicates(
    paths=['src/', 'tests/'],
    min_block_size=10,
    similarity_threshold=0.85
)

print(f"Found {len(duplicates)} duplicates")
for dup in duplicates:
    print(f"{dup['file1']}:{dup['start_line1']}{dup['file2']}:{dup['start_line2']}")
    print(f"Similarity: {dup['similarity']*100:.1f}%")

Rust Library

Use the core library in your Rust project:

[dependencies]
polydup-core = "0.1"
use polydup_core::{Scanner, find_duplicates};
use std::path::PathBuf;

fn main() -> anyhow::Result<()> {
    let scanner = Scanner::with_config(10, 0.85)?;
    let report = scanner.scan(vec![PathBuf::from("src")])?;

    println!("Found {} duplicates", report.duplicates.len());
    Ok(())
}

Building from Source

CLI

cargo build --release -p polydup
./target/release/polydup scan ./src

Node.js

cd crates/polydup-node
npm install
npm run build

Python

cd crates/polydup-py
maturin develop
python -c "import polydup; print(polydup.version())"

CLI Usage

Quick Start with polydup init

The fastest way to get started is with the interactive initialization wizard:

# Run the initialization wizard
polydup init

# Non-interactive mode (use defaults)
polydup init --yes

# Force overwrite existing configuration
polydup init --force

The wizard will:

  • Auto-detect your project environment (Node.js, Rust, Python, etc.)
  • Generate .polyduprc.toml with environment-specific defaults
  • Create GitHub Actions workflow (optional)
  • Show install instructions tailored to your environment
  • Provide next steps for local usage

Example workflow:

$ polydup init

PolyDup Initialization Wizard
=============================

Detected environments:
  - Node.js
  - Python

 Select similarity threshold: Standard (0.85)
 Select minimum block size: Medium (50 lines)
 Add custom exclude patterns? · No
 Would you like to create a GitHub Actions workflow? · Yes

Configuration saved to: .polyduprc.toml
GitHub Actions workflow created: .github/workflows/polydup.yml

Next Steps:
  1. Install: npm install -g @polydup/core
  2. Scan: polydup scan ./src

Configuration File (.polyduprc.toml)

After running polydup init, you'll have a .polyduprc.toml file:

[scan]
min_block_size = 50
similarity_threshold = 0.85

[scan.exclude]
patterns = [
    "**/node_modules/**",
    "**/__pycache__/**",
    "**/*.test.js",
    "**/*.test.py",
]

[output]
format = "text"
verbose = false

[ci]
enabled = false
fail_on_duplicates = true

Configuration Discovery:

  • PolyDup searches for .polyduprc.toml in current directory and parent directories
  • CLI arguments override config file settings
  • Perfect for monorepos with shared configuration at root

Basic Commands

# Scan a directory
polydup scan ./src

# Scan multiple directories
polydup scan ./src ./tests ./lib

# Custom threshold (0.0-1.0, higher = stricter)
polydup scan ./src --threshold 0.85

# Adjust minimum block size (lines)
polydup scan ./src --min-block-size 50

# JSON output for scripting
polydup scan ./src --format json > duplicates.json

Examples

Quick scan for severe duplicates:

polydup scan ./src --threshold 0.95 --min-block-size 20

Deep scan for similar code:

polydup scan ./src --threshold 0.70 --min-block-size 5

Scan specific file types:

# PolyDup auto-detects: .rs, .js, .ts, .jsx, .tsx, .py, .vue, .svelte
polydup scan ./src  # Scans all supported languages

CI/CD integration:

# Exit with error if duplicates found
polydup scan ./src --threshold 0.90 || exit 1

Output Formats

Text (default): Human-readable colored output with file paths, line numbers, and similarity scores

JSON: Machine-readable format for scripting and tooling integration

polydup scan ./src --format json | jq '.duplicates | length'

Commands

PolyDup supports the following subcommands:

Command Description Example
scan Scan for duplicate code (default command) polydup scan ./src
init Interactive setup wizard polydup init

Scan Command Options:

The scan command accepts all options listed below. When no subcommand is specified, scan is assumed for backward compatibility.

# These are equivalent:
polydup scan ./src --threshold 0.95
polydup ./src --threshold 0.95

Init Command Options:

Option Description
--yes, -y Skip interactive prompts, use defaults
--force Overwrite existing .polyduprc.toml

CLI Options

Option Type Default Description
--threshold float 0.9 Similarity threshold (0.0-1.0)
--min-block-size int 10 Minimum lines per code block
--format text|json text Output format
--output path - Write report to file
--only-type types - Filter by clone type (type-1, type-2, type-3)
--exclude-type types - Exclude clone types
--group-by criterion - Group results (file, similarity, type, size)
--verbose flag false Show performance statistics
--no-color flag false Disable colored output
--debug flag false Enable debug mode with detailed traces
--enable-type3 flag false Enable Type-3 gap-tolerant detection
--save-baseline path - Save scan results as baseline for future comparisons
--compare-to path - Compare against baseline (show only new duplicates)
--git-diff range - Only scan files changed in git diff range (e.g., origin/main..HEAD) ⚡ Recommended for CI

Performance Tip: For large codebases (>50K LOC), increase --min-block-size to 20-50 for faster scans with less noise.

Baseline/Snapshot Mode

The most powerful feature for CI/CD: Block new duplicates without failing on legacy code.

Use Case: "We have existing duplicates, but block any NEW ones"

Many codebases have legacy duplication that's not worth fixing immediately. Baseline mode lets you:

  • ✅ Accept existing duplicates as-is
  • ✅ Fail CI/CD only when new duplicates are introduced
  • ✅ Gradually reduce technical debt without blocking development

Quick Start

Step 1: Create baseline from your main branch

# On main/master branch: capture current state
polydup scan ./src --save-baseline .polydup-baseline.json
git add .polydup-baseline.json
git commit -m "chore: add duplication baseline"

Step 2: Use in CI/CD to block new duplicates

# .github/workflows/polydup.yml
- name: Check for new duplicates
  run: |
    polydup scan ./src --compare-to .polydup-baseline.json
    # Exits with code 1 if NEW duplicates found
    # Exits with code 0 if no new duplicates (CI passes)

Step 3: See it in action on a PR

# Developer adds duplicate code in feature branch
polydup scan ./src --compare-to .polydup-baseline.json

Output:

ℹ Comparing against baseline: .polydup-baseline.json
  11 total duplicates, 3 new since baseline

Duplicates
═══════════════════════════════════════════

1. Type-2 (renamed) | Similarity: 100.0% | Length: 59 tokens
   ├─ src/new-feature.ts:12
   └─ src/utils.ts:45

❌ 3 new duplicates found since baseline

Exit code: 1 (CI fails, PR blocked)

Advanced Baseline Workflows

Incremental improvement: Update baseline after cleanup

# Team cleans up 10 duplicates
polydup scan ./src --save-baseline .polydup-baseline.json
git add .polydup-baseline.json
git commit -m "chore: update baseline after duplication cleanup"

Combining with filters

# Save baseline excluding Type-3 (noisy matches)
polydup scan ./src --exclude-type type-3 --save-baseline baseline.json

# Only block new Type-1 and Type-2 duplicates
polydup scan ./src --only-type type-1,type-2 --compare-to baseline.json

Manual review mode

# See what duplicates are NEW (no CI failure, just info)
polydup scan ./src --compare-to baseline.json --format json \
  | jq '.duplicates | length'

Real-world Example: PR Comments

Use with GitHub Actions to comment on PRs:

- name: Check duplicates
  id: polydup
  run: |
    OUTPUT=$(polydup scan ./src --compare-to .polydup-baseline.json --format json || true)
    NEW_COUNT=$(echo "$OUTPUT" | jq '.duplicates | length')
    echo "new_duplicates=$NEW_COUNT" >> $GITHUB_OUTPUT

- name: Comment on PR
  if: steps.polydup.outputs.new_duplicates > 0
  uses: actions/github-script@v7
  with:
    script: |
      github.rest.issues.createComment({
        issue_number: context.issue.number,
        owner: context.repo.owner,
        repo: context.repo.repo,
        body: '⚠️ This PR introduces ${{ steps.polydup.outputs.new_duplicates }} new code duplicates. Please refactor before merging.'
      })

Git-Diff Mode 🚀 RECOMMENDED FOR CI/CD

The fastest, simplest way to check for duplicates in Pull Requests.

Why Git-Diff Mode?

Advantages over Baseline Mode:

  • 10-100x faster - Only scans files changed in the diff
  • No file management - No baseline file to commit/sync
  • Universal - Works on all CI platforms (GitHub, GitLab, Jenkins, etc.)
  • Simpler - Just specify a git range, no baseline setup needed
  • Accurate - Works in shallow clones (common in CI environments)

Quick Start

Single command to check duplicates in a PR:

# Scan only files changed between main and current branch
polydup scan . --git-diff origin/main..HEAD

CI/CD Integration (GitHub Actions):

# .github/workflows/polydup.yml
name: Check for Duplicates

on:
  pull_request:
    branches: [main]

jobs:
  duplicate-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Need full history for git diff

      - name: Install PolyDup
        run: cargo install polydup

      - name: Check for duplicates in PR
        run: |
          polydup scan . --git-diff origin/main..HEAD
          # Exits with code 1 if duplicates found in the diff
          # Exits with code 0 if no duplicates (CI passes)

Common Usage Patterns

1. Check uncommitted changes:

polydup scan . --git-diff HEAD

2. Compare branches:

polydup scan . --git-diff main..feature-branch

3. Check last N commits:

polydup scan . --git-diff HEAD~3..HEAD

4. JSON output for tooling:

polydup scan . --git-diff origin/main..HEAD --format json

How It Works

  1. Runs git diff --name-only --diff-filter=ACMR <range>
  2. Gets list of Added, Copied, Modified, Renamed files
  3. Filters out deleted files (can't scan what doesn't exist)
  4. Scans only those files for duplicates
  5. Exits with code 1 if duplicates found, 0 otherwise

Real-World CI Example

# Before: Scanning entire codebase (50K LOC, 500 files)
polydup scan ./src  # 🐢 Takes 15-20 seconds

# After: Git-diff mode (PR with 5 changed files)
polydup scan . --git-diff origin/main..HEAD  # ⚡ Takes 0.5-1 second

10-100x speedup on large codebases with focused PRs!

Edge Cases Handled

  • Deleted files - Automatically filtered out (can't scan deleted code)
  • Renamed files - Detected via --diff-filter=R, scanned correctly
  • Shallow clones - Works in CI environments with fetch-depth: 0
  • Invalid ranges - Clear error message with suggestions

When to Use Git-Diff vs Baseline

Use Git-Diff Mode (recommended):

  • ✅ Pull Request checks in CI/CD
  • ✅ Fast feedback on code changes
  • ✅ Git-based workflows

Use Baseline Mode when:

  • ✅ Non-git workflows (Perforce, SVN, etc.)
  • ✅ Tracking historical debt reduction
  • ✅ Explicit acceptance of legacy duplicates

Advanced Features

Filtering by Clone Type

Focus on specific types of duplicates for targeted refactoring:

# Show only exact duplicates (highest priority)
polydup scan ./src --only-type type-1

# Show only renamed duplicates
polydup scan ./src --only-type type-2

# Show both Type-1 and Type-2
polydup scan ./src --only-type type-1,type-2

# Exclude noisy Type-3 matches
polydup scan ./src --exclude-type type-3

Use cases:

  • --only-type type-1: Quick wins for immediate refactoring
  • --only-type type-2: Identify abstraction opportunities
  • --exclude-type type-3: Reduce false positives in large codebases

Grouping Results

Organize duplicates for different workflows:

# Group by file (refactoring prioritization)
polydup scan ./src --group-by file

# Group by similarity (quality triage)
polydup scan ./src --group-by similarity

# Group by clone type (targeted cleanup)
polydup scan ./src --group-by type

# Group by size (impact analysis)
polydup scan ./src --group-by size

Grouping strategies:

  • file: See which files need refactoring most
  • similarity: Prioritize high-confidence matches
  • type: Handle Type-1 separately from Type-2
  • size: Focus on large duplicates for maximum impact

Output Options

# Save report to file
polydup scan ./src --output duplicates.txt

# JSON for CI/CD pipelines
polydup scan ./src --format json --output report.json

# Disable colors for logs
polydup scan ./src --no-color

# Or use NO_COLOR environment variable
NO_COLOR=1 polydup scan ./src

# Verbose mode with performance stats
polydup scan ./src --verbose

Debug Mode

Enhanced error messages with actionable suggestions:

# Enable debug mode for troubleshooting
polydup scan ./src --debug

# Debug mode shows:
# - Current working directory
# - File access permissions
# - Parser errors with context
# - Configuration validation details

Example error output:

Error: Path does not exist: /nonexistent/path

Suggestion: Check the path spelling and ensure it exists
  Example: polydup scan ./src
           polydup scan /absolute/path/to/project

Debug Info: Current directory: /Users/you/project

Combining Features

Mix and match for powerful workflows:

# High-priority refactoring targets
polydup scan ./src \
  --only-type type-1 \
  --group-by file \
  --min-block-size 50 \
  --output refactor-priorities.txt

# CI/CD duplicate gate
polydup scan ./src \
  --threshold 0.95 \
  --exclude-type type-3 \
  --format json \
  --output duplicates.json

# Deep analysis with verbose stats
polydup scan ./src \
  --enable-type3 \
  --group-by similarity \
  --verbose

# Quick triage without noise
polydup scan ./src \
  --only-type type-1,type-2 \
  --group-by type \
  --no-color

Dashboard Output

PolyDup provides a professional dashboard with actionable insights:

╔═══════════════════════════════════════════════════════════╗
║                      Scan Results                         ║
╠═══════════════════════════════════════════════════════════╣
║ Files scanned:       142                                  ║
║ Functions analyzed:  287                                  ║
║ Duplicates found:    15                                   ║
║ Estimated savings:   ~450 lines                           ║
╠═══════════════════════════════════════════════════════════╣
║ Clone Type Breakdown:                                     ║
║   Type-1 (exact):    5 groups  │ Critical priority       ║
║   Type-2 (renamed):  8 groups  │ High priority           ║
║   Type-3 (modified): 2 groups  │ Medium priority         ║
╠═══════════════════════════════════════════════════════════╣
║ Top Offenders:                                            ║
║   1. src/handlers.ts      8 duplicates                    ║
║   2. lib/utils.ts         5 duplicates                    ║
║   3. components/Form.tsx  3 duplicates                    ║
╚═══════════════════════════════════════════════════════════╝

Duplicate #1 (Type-2: Renamed identifiers)
  Location: src/auth.ts:45-68 ↔ src/admin.ts:120-143
  Similarity: 94.2% | Length: 24 lines
  ...

Dashboard features:

  • Lines saved estimation: Potential code reduction
  • Top offenders: Files needing most attention
  • Similarity range: Quality distribution (min-max)
  • Priority labels: Critical (Type-1), High (Type-2), Medium (Type-3)

Exit Codes

PolyDup uses semantic exit codes for CI/CD integration:

Exit Code Meaning Use Case
0 No duplicates found Clean codebase ✓
1 Duplicates detected Quality gate (expected)
2 Error occurred Configuration/runtime issue

CI/CD examples:

# Fail build if duplicates found
polydup scan ./src || exit 1

# Warning only (report but don't fail)
polydup scan ./src || true

# Strict quality gate (fail on any duplicates)
if polydup scan ./src --threshold 0.95; then
  echo "No duplicates found"
else
  echo "⚠️ Duplicates detected - please refactor"
  exit 1
fi

Supported Languages

  • JavaScript/TypeScript: .js, .jsx, .ts, .tsx
  • Python: .py
  • Rust: .rs
  • Vue: .vue
  • Svelte: .svelte

More languages coming soon (Java, Go, C/C++, Ruby, PHP)

Clone Types

PolyDup classifies duplicates into different types based on the International Workshop on Software Clones (IWSC) taxonomy:

Type-1: Exact Clones

Identical code fragments except for whitespace, comments, and formatting.

Example:

// File 1
function calculateTotal(items) {
    let sum = 0;
    for (let i = 0; i < items.length; i++) {
        sum += items[i].price;
    }
    return sum;
}

// File 2 (Type-1 clone - only formatting differs)
function calculateTotal(items) {
  let sum = 0;
  for (let i = 0; i < items.length; i++) { sum += items[i].price; }
  return sum;
}

Why they exist: Direct copy-paste without any modifications.

Type-2: Renamed/Parameterized Clones

Structurally identical code with renamed identifiers, changed literals, or different types.

Example:

// File 1
function calculateTotal(items) {
    let sum = 0;
    for (let i = 0; i < items.length; i++) {
        sum += items[i].price;
    }
    return sum;
}

// File 2 (Type-2 clone - renamed variables, same logic)
function computeSum(products) {
    let total = 0;
    for (let j = 0; j < products.length; j++) {
        total += products[j].cost;
    }
    return total;
}

Why they exist: Copy-paste-modify pattern where developers adapt code for different contexts.

Detection: PolyDup normalizes identifiers and literals (e.g., sum@@ID, 0@@NUM) to detect structural similarity.

Type-3: Near-Miss Clones (Not Yet Implemented)

Similar code with minor modifications like inserted/deleted statements or changed expressions.

Example:

// File 1
function processOrder(order) {
    validateOrder(order);
    let total = calculateTotal(order.items);
    applyDiscount(total, order.coupon);
    return total;
}

// File 2 (Type-3 clone - added logging, changed discount logic)
function processOrder(order) {
    validateOrder(order);
    console.log("Processing order:", order.id);  // ADDED
    let total = calculateTotal(order.items);
    let discount = order.coupon ? 0.1 : 0;      // MODIFIED
    total *= (1 - discount);                     // MODIFIED
    return total;
}

Why they exist: Code evolution, bug fixes, or feature additions that slightly modify duplicated logic.

Type-4: Semantic Clones (Not Yet Implemented)

Functionally equivalent code with different implementations.

Example:

// File 1 - Imperative loop
function sum(arr) {
    let total = 0;
    for (let i = 0; i < arr.length; i++) {
        total += arr[i];
    }
    return total;
}

// File 2 - Functional approach
function sum(arr) {
    return arr.reduce((acc, val) => acc + val, 0);
}

// File 3 - Recursive
function sum(arr, i = 0) {
    if (i >= arr.length) return 0;
    return arr[i] + sum(arr, i + 1);
}

Why they exist: Different programming paradigms or styles achieving the same result.

Detection Challenge: Requires semantic analysis, control-flow graphs, or ML-based approaches.

Understanding Your Results

When PolyDup reports duplicates, the clone type indicates:

  • Type-1: Exact copy-paste → Quick win for extraction into shared utilities
  • Type-2: Adapted copy-paste → Candidate for parameterized functions or generics
  • Type-3: Modified duplicates → May require refactoring with strategy patterns
  • Type-4: Semantic equivalence → Consider standardizing on one implementation

Typical Real-World Distribution:

  • Type-1: 5-10% (rare in mature codebases)
  • Type-2: 60-70% (most common - copy-paste-modify)
  • Type-3: 20-30% (evolved duplicates)
  • Type-4: <5% (requires specialized detection)

Performance Note: PolyDup efficiently handles codebases up to 100K LOC. Tested on real-world projects with detection times under 5 seconds for most repos.

Troubleshooting

Common Issues

"No duplicates found" but you expect some

Possible causes:

  • Threshold too high: Try lowering --threshold to 0.70-0.80
  • Block size too large: Reduce --min-block-size to 5-10 lines
  • Type-3 not enabled: Add --enable-type3 for gap-tolerant matching
# More sensitive scan
polydup scan ./src --threshold 0.70 --min-block-size 5 --enable-type3

"Too many false positives"

Solutions:

  • Increase threshold: Use --threshold 0.95 for high-confidence matches
  • Exclude Type-3: Add --exclude-type type-3 to remove noisy matches
  • Increase block size: Use --min-block-size 50 for substantial duplicates only
# Strict, high-quality scan
polydup scan ./src --threshold 0.95 --exclude-type type-3 --min-block-size 50

"Permission denied" errors

Fix:

# Check file permissions
ls -la /path/to/scan

# Run with proper permissions
chmod +r /path/to/files

# Use --debug to see detailed error info
polydup scan ./src --debug

"Unsupported file type" warnings

Explanation: PolyDup currently supports JavaScript, TypeScript, Python, Rust, Vue, and Svelte. Other file types are skipped automatically.

Workaround:

Colors not working in CI/CD

Solution:

# Disable colors explicitly
polydup scan ./src --no-color

# Or use environment variable
NO_COLOR=1 polydup scan ./src

"Out of memory" on large codebases

Solutions:

# Increase minimum block size to reduce memory usage
polydup scan ./src --min-block-size 100

# Scan directories separately
polydup scan ./src
polydup scan ./tests
polydup scan ./lib

# Exclude generated/vendor code
# Create .polyduprc.toml with exclude patterns

Performance Tips

For large codebases (>50K LOC):

  • Use --min-block-size 50-100 to focus on substantial duplicates
  • Disable Type-3 detection (it's more computationally expensive)
  • Use --exclude-type type-3 to skip gap-tolerant matching
  • Increase --threshold to 0.95 to reduce candidate matches

For monorepos:

  • Create .polyduprc.toml at root with shared configuration
  • Use --group-by file to organize results by module
  • Exclude node_modules, dist, target, etc. in config

For CI/CD:

  • Cache the polydup binary to speed up pipeline
  • Use --format json for machine-readable output
  • Set appropriate exit code handling (0=clean, 1=duplicates, 2=error)

Getting Help

Debug Mode:

# Enable detailed error traces
polydup scan ./src --debug

Verbose Output:

# Show performance statistics
polydup scan ./src --verbose

Report an Issue:

  1. Check existing issues
  2. Include:
    • PolyDup version (polydup --version)
    • Operating system and architecture
    • Command that failed
    • Error message with --debug flag
    • Sample code if applicable (anonymized)

Community:

Development

Building from Source

Prerequisites:

  • Rust 1.70+ (curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh)
  • Node.js 16+ (for Node.js bindings)
  • Python 3.8-3.12 (for Python bindings)

CLI:

git clone https://github.com/wiesnerbernard/polydup.git
cd polydup
cargo build --release -p polydup
./target/release/polydup scan ./src

Node.js bindings:

cd crates/polydup-node
npm install
npm run build
npm test

Python bindings:

cd crates/polydup-py
pip install maturin
maturin develop
python -c "import polydup; print(polydup.version())"

Run tests:

# All tests
cargo test --workspace

# Specific crate
cargo test -p polydup-core

# With coverage
cargo install cargo-tarpaulin
cargo tarpaulin --workspace

Creating a Release

Recommended: Create releases directly from GitHub UI - fully automated, no local tools required!

  1. Go to Releases → New Release
  2. Create a new tag (e.g., v0.2.7)
  3. Click "Publish release"
  4. Everything happens automatically (~5-7 minutes):
    • Syncs version files (Cargo.toml, package.json, pyproject.toml)
    • Updates CHANGELOG.md with release entry
    • Moves tag to version-synced commit (if needed)
    • Builds binaries for all 5 platforms (macOS/Linux/Windows)
    • Publishes to crates.io, npm, and PyPI
    • Creates release with binary assets
    • Zero manual steps required - truly one-click releases!

Alternative: Use the release script locally:

./scripts/release.sh 0.2.5

See docs/RELEASE.md for detailed instructions.

Pre-commit Hooks

Install pre-commit hooks to automatically run linting and tests:

# Install pre-commit (if not already installed)
pip install pre-commit

# Install the git hooks
pre-commit install
pre-commit install -t pre-push

# Run manually on all files
pre-commit run --all-files

The hooks will automatically run:

  • On commit: cargo fmt, cargo clippy, file checks (trailing whitespace, YAML/TOML validation)
  • On push: Full test suite with cargo test

To skip hooks temporarily:

git commit --no-verify

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Install pre-commit hooks (pre-commit install)
  4. Make your changes and ensure tests pass (cargo test --workspace)
  5. Run clippy (cargo clippy --workspace --all-targets -- -D warnings)
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

See CONTRIBUTING.md for detailed guidelines.

License

MIT OR Apache-2.0