PolyDup

Cross-language duplicate code detector powered by Tree-sitter and Rust.

Features

Blazing Fast: Parallel processing with Rabin-Karp rolling hash algorithm (up to 10x faster than regex-based detectors)
Cross-Language: JavaScript, TypeScript, Python, Rust, Vue, Svelte (more coming)
Accurate: Tree-sitter AST parsing for semantic-aware detection (eliminates false positives from comments/strings)
Multi-Platform: CLI, Node.js npm package, Python pip package, Rust library
Configurable: Adjust thresholds and block sizes for your needs
Efficient: Zero-copy FFI bindings for minimal overhead (passes file paths, not contents)

Architecture

Shared Core Architecture: All duplicate detection logic lives in Rust, exposed via FFI bindings.

┌─────────────────────────────────────────────┐
│           polydup-core (Rust)               │
│  • Tree-sitter parsing                      │
│  • Rabin-Karp hashing                       │
│  • Parallel file scanning                   │
│  • Duplicate detection                      │
└─────────────────────────────────────────────┘
          ▲          ▲          ▲
          │          │          │
    ┌─────┴───┐  ┌───┴────┐  ┌─┴─────┐
    │ CLI     │  │ Node.js│  │ Python│
    │ (Rust)  │  │(napi-rs)│  │(PyO3) │
    └─────────┘  └────────┘  └───────┘

Crates:

polydup-core: Pure Rust library with Tree-sitter parsing, hashing, and reporting
polydup (CLI): Standalone CLI tool (cargo install polydup)
polydup-node: Node.js library bindings via napi-rs (npm install @polydup/core)
polydup-py: Python library bindings via PyO3 (pip install polydup)

Installation

Important: PolyDup is available in multiple forms for different use cases:

CLI Tool: cargo install polydup - Command-line scanning

Python Library: pip install polydup - Python API bindings (NOT a CLI)

Node.js Library: npm install @polydup/core - Node.js API bindings (NOT a CLI)

If you want to run polydup from the command line, use cargo install polydup.

GitHub Action (Easiest for CI/CD) 🚀

The fastest way to add duplicate detection to your workflow:

name: Code Quality

on:
  pull_request:
    branches: [ main ]

permissions:
  contents: read
  pull-requests: write  # Required for PR comments

jobs:
  duplicate-detection:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Required for git-diff mode

      - uses: wiesnerbernard/polydup-action@v0.2.1
        with:
          threshold: 50
          similarity: '0.85'
          fail-on-duplicates: true
          github-token: ${{ secrets.GITHUB_TOKEN }}

Benefits:

🚀 10-100x faster (only scans changed files in PR)
💬 Automatic PR comments with duplicate reports
✅ Zero configuration needed
🔒 Secure (no data leaves your repository)

Action Inputs

Input	Default	Description
`threshold`	`50`	Minimum code block size in tokens
`similarity`	`0.85`	Similarity threshold (0.0-1.0)
`fail-on-duplicates`	`true`	Fail the check if duplicates found
`format`	`text`	Output format: `text` or `json`
`base-ref`	auto	Base git reference (auto-detects from PR)
`github-token`	-	Token for PR comments
`comment-on-pr`	`true`	Post results as PR comment

Action Outputs

Output	Description
`duplicates-found`	Number of duplicate code blocks found
`files-scanned`	Number of files scanned
`exit-code`	Exit code (0 = no duplicates, 1 = duplicates)

Using Outputs in Workflows

- uses: wiesnerbernard/polydup-action@v0.2.1
  id: polydup
  with:
    fail-on-duplicates: false
    github-token: ${{ secrets.GITHUB_TOKEN }}

- name: Check results
  run: |
    echo "Files scanned: ${{ steps.polydup.outputs.files-scanned }}"
    echo "Duplicates found: ${{ steps.polydup.outputs.duplicates-found }}"
    if [ "${{ steps.polydup.outputs.duplicates-found }}" -gt 10 ]; then
      echo "Too many duplicates!"
      exit 1
    fi

Example PR Comment

When duplicates are found, the action posts a comment like:

## PolyDup Duplicate Code Report

**Found 3 duplicate code block(s)**

- Files scanned: 12
- Threshold: 50 tokens
- Similarity: 0.85

<details>
<summary>View Details</summary>
[Detailed scan output...]
</details>

**Tip**: Consider refactoring duplicated code to improve maintainability.

See polydup-action for full documentation.

Rust CLI (Recommended for Local Development)

The fastest way to use PolyDup locally is via the CLI tool:

# Install from crates.io
cargo install polydup

# Verify installation
polydup --version

# Scan for duplicates
polydup scan ./src

System Requirements:

Rust 1.70+ (if building from source)
macOS, Linux, or Windows

Note: Homebrew tap coming soon! (brew install polydup)

Pre-built Binaries:

Download pre-compiled binaries from GitHub Releases:

# macOS (Apple Silicon)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-macos-aarch64 -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/

# macOS (Intel)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-macos-x86_64 -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/

# Linux (x86_64)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-linux-x86_64 -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/

# Linux (x86_64 static - musl)
curl -L https://github.com/wiesnerbernard/polydup/releases/latest/download/polydup-linux-x86_64-musl -o polydup
chmod +x polydup
sudo mv polydup /usr/local/bin/

# Windows (x86_64)
# Download polydup-windows-x86_64.exe from releases page and add to PATH

Node.js/npm (Library Only)

Note: This is a library package for integrating duplicate detection into Node.js applications. It does NOT provide a CLI. For command-line usage, use cargo install polydup.

Install as a project dependency:

npm install @polydup/core

Requirements: Node.js 16+ on macOS (Intel/ARM), Windows (x64), or Linux (x64)

Usage:

const { findDuplicates } = require('@polydup/core');

const duplicates = findDuplicates(
  ['src/', 'tests/'],  // Paths to scan
  10,                  // Minimum block size (lines)
  0.85                 // Similarity threshold (0.0-1.0)
);

console.log(`Found ${duplicates.length} duplicates`);
duplicates.forEach(dup => {
  console.log(`${dup.file1}:${dup.start_line1} ↔ ${dup.file2}:${dup.start_line2}`);
  console.log(`Similarity: ${(dup.similarity * 100).toFixed(1)}%`);
});

Python/pip (Library Only)

Note: This is a library package for integrating duplicate detection into Python applications. It does NOT provide a CLI. For command-line usage, use cargo install polydup.

Running python -m polydup will display installation guidance.

Install from PyPI:

# Using pip
pip install polydup

# Using uv (recommended for faster installs)
uv pip install polydup

Requirements: Python 3.8-3.12 on macOS (Intel/ARM), Windows (x64), or Linux (x64)

Usage:

import polydup

# Scan for duplicates
duplicates = polydup.find_duplicates(
    paths=['src/', 'tests/'],
    min_block_size=10,
    similarity_threshold=0.85
)

print(f"Found {len(duplicates)} duplicates")
for dup in duplicates:
    print(f"{dup['file1']}:{dup['start_line1']} ↔ {dup['file2']}:{dup['start_line2']}")
    print(f"Similarity: {dup['similarity']*100:.1f}%")

Rust Library

Use the core library in your Rust project:

[dependencies]
polydup-core = "0.1"

use polydup_core::{Scanner, find_duplicates};
use std::path::PathBuf;

fn main() -> anyhow::Result<()> {
    let scanner = Scanner::with_config(10, 0.85)?;
    let report = scanner.scan(vec![PathBuf::from("src")])?;

    println!("Found {} duplicates", report.duplicates.len());
    Ok(())
}

Building from Source

CLI

cargo build --release -p polydup
./target/release/polydup scan ./src

Node.js

cd crates/polydup-node
npm install
npm run build

Python

cd crates/polydup-py
maturin develop
python -c "import polydup; print(polydup.version())"

CLI Usage

Quick Start with `polydup init`

The fastest way to get started is with the interactive initialization wizard:

# Run the initialization wizard
polydup init

# Non-interactive mode (use defaults)
polydup init --yes

# Force overwrite existing configuration
polydup init --force

The wizard will:

Auto-detect your project environment (Node.js, Rust, Python, etc.)
Generate .polyduprc.toml with environment-specific defaults
Create GitHub Actions workflow (optional)
Show install instructions tailored to your environment
Provide next steps for local usage

Example workflow:

$ polydup init

PolyDup Initialization Wizard
=============================

Detected environments:
  - Node.js
  - Python

✔ Select similarity threshold: Standard (0.85)
✔ Select minimum block size: Medium (50 lines)
✔ Add custom exclude patterns? · No
✔ Would you like to create a GitHub Actions workflow? · Yes

Configuration saved to: .polyduprc.toml
GitHub Actions workflow created: .github/workflows/polydup.yml

Next Steps:
  1. Install: npm install -g @polydup/core
  2. Scan: polydup scan ./src

Configuration File (`.polyduprc.toml`)

After running polydup init, you'll have a .polyduprc.toml file:

[scan]
min_block_size = 50
similarity_threshold = 0.85

[scan.exclude]
patterns = [
    "**/node_modules/**",
    "**/__pycache__/**",
    "**/*.test.js",
    "**/*.test.py",
]

[output]
format = "text"
verbose = false

[ci]
enabled = false
fail_on_duplicates = true

Configuration Discovery:

PolyDup searches for .polyduprc.toml in current directory and parent directories
CLI arguments override config file settings
Perfect for monorepos with shared configuration at root

Basic Commands

# Scan a directory
polydup scan ./src

# Scan multiple directories
polydup scan ./src ./tests ./lib

# Custom threshold (0.0-1.0, higher = stricter)
polydup scan ./src --threshold 0.85

# Adjust minimum block size (lines)
polydup scan ./src --min-block-size 50

# JSON output for scripting
polydup scan ./src --format json > duplicates.json

Examples

Quick scan for severe duplicates:

polydup scan ./src --threshold 0.95 --min-block-size 20

Deep scan for similar code:

polydup scan ./src --threshold 0.70 --min-block-size 5

Scan specific file types:

# PolyDup auto-detects: .rs, .js, .ts, .jsx, .tsx, .py, .vue, .svelte
polydup scan ./src  # Scans all supported languages

CI/CD integration:

# Exit with error if duplicates found
polydup scan ./src --threshold 0.90 || exit 1

Output Formats

Text (default): Human-readable colored output with file paths, line numbers, and similarity scores

JSON: Machine-readable format for scripting and tooling integration

polydup scan ./src --format json | jq '.duplicates | length'

Commands

PolyDup supports the following subcommands:

Command	Description	Example
`scan`	Scan for duplicate code (default command)	`polydup scan ./src`
`init`	Interactive setup wizard	`polydup init`

Scan Command Options:

The scan command accepts all options listed below. When no subcommand is specified, scan is assumed for backward compatibility.

# These are equivalent:
polydup scan ./src --threshold 0.95
polydup ./src --threshold 0.95

Init Command Options:

Option	Description
`--yes`, `-y`	Skip interactive prompts, use defaults
`--force`	Overwrite existing `.polyduprc.toml`

CLI Options

Option	Type	Default	Description
`--threshold`	float	0.9	Similarity threshold (0.0-1.0)
`--min-block-size`	int	10	Minimum lines per code block
`--format`	text\|json	text	Output format
`--output`	path	-	Write report to file
`--only-type`	types	-	Filter by clone type (type-1, type-2, type-3)
`--exclude-type`	types	-	Exclude clone types
`--group-by`	criterion	-	Group results (file, similarity, type, size)
`--verbose`	flag	false	Show performance statistics
`--no-color`	flag	false	Disable colored output
`--debug`	flag	false	Enable debug mode with detailed traces
`--enable-type3`	flag	false	Enable Type-3 gap-tolerant detection
`--save-baseline`	path	-	Save scan results as baseline for future comparisons
`--compare-to`	path	-	Compare against baseline (show only new duplicates)
`--git-diff`	range	-	Only scan files changed in git diff range (e.g., `origin/main..HEAD`) ⚡ Recommended for CI

Performance Tip: For large codebases (>50K LOC), increase --min-block-size to 20-50 for faster scans with less noise.

Baseline/Snapshot Mode

The most powerful feature for CI/CD: Block new duplicates without failing on legacy code.

Use Case: "We have existing duplicates, but block any NEW ones"

Many codebases have legacy duplication that's not worth fixing immediately. Baseline mode lets you:

✅ Accept existing duplicates as-is
✅ Fail CI/CD only when new duplicates are introduced
✅ Gradually reduce technical debt without blocking development

Quick Start

Step 1: Create baseline from your main branch

# On main/master branch: capture current state
polydup scan ./src --save-baseline .polydup-baseline.json
git add .polydup-baseline.json
git commit -m "chore: add duplication baseline"

Step 2: Use in CI/CD to block new duplicates

# .github/workflows/polydup.yml
- name: Check for new duplicates
  run: |
    polydup scan ./src --compare-to .polydup-baseline.json
    # Exits with code 1 if NEW duplicates found
    # Exits with code 0 if no new duplicates (CI passes)

Step 3: See it in action on a PR

# Developer adds duplicate code in feature branch
polydup scan ./src --compare-to .polydup-baseline.json

Output:

ℹ Comparing against baseline: .polydup-baseline.json
  11 total duplicates, 3 new since baseline

Duplicates
═══════════════════════════════════════════

1. Type-2 (renamed) | Similarity: 100.0% | Length: 59 tokens
   ├─ src/new-feature.ts:12
   └─ src/utils.ts:45

❌ 3 new duplicates found since baseline

Exit code: 1 (CI fails, PR blocked)

Advanced Baseline Workflows

Incremental improvement: Update baseline after cleanup

# Team cleans up 10 duplicates
polydup scan ./src --save-baseline .polydup-baseline.json
git add .polydup-baseline.json
git commit -m "chore: update baseline after duplication cleanup"

Combining with filters

# Save baseline excluding Type-3 (noisy matches)
polydup scan ./src --exclude-type type-3 --save-baseline baseline.json

# Only block new Type-1 and Type-2 duplicates
polydup scan ./src --only-type type-1,type-2 --compare-to baseline.json

Manual review mode

# See what duplicates are NEW (no CI failure, just info)
polydup scan ./src --compare-to baseline.json --format json \
  | jq '.duplicates | length'

Real-world Example: PR Comments

Use with GitHub Actions to comment on PRs:

- name: Check duplicates
  id: polydup
  run: |
    OUTPUT=$(polydup scan ./src --compare-to .polydup-baseline.json --format json || true)
    NEW_COUNT=$(echo "$OUTPUT" | jq '.duplicates | length')
    echo "new_duplicates=$NEW_COUNT" >> $GITHUB_OUTPUT

- name: Comment on PR
  if: steps.polydup.outputs.new_duplicates > 0
  uses: actions/github-script@v7
  with:
    script: |
      github.rest.issues.createComment({
        issue_number: context.issue.number,
        owner: context.repo.owner,
        repo: context.repo.repo,
        body: '⚠️ This PR introduces ${{ steps.polydup.outputs.new_duplicates }} new code duplicates. Please refactor before merging.'
      })

Git-Diff Mode 🚀 RECOMMENDED FOR CI/CD

The fastest, simplest way to check for duplicates in Pull Requests.

Why Git-Diff Mode?

Advantages over Baseline Mode:

✅ 10-100x faster - Only scans files changed in the diff
✅ No file management - No baseline file to commit/sync
✅ Universal - Works on all CI platforms (GitHub, GitLab, Jenkins, etc.)
✅ Simpler - Just specify a git range, no baseline setup needed
✅ Accurate - Works in shallow clones (common in CI environments)

Quick Start

Single command to check duplicates in a PR:

# Scan only files changed between main and current branch
polydup scan . --git-diff origin/main..HEAD

CI/CD Integration (GitHub Actions):

# .github/workflows/polydup.yml
name: Check for Duplicates

on:
  pull_request:
    branches: [main]

jobs:
  duplicate-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Need full history for git diff

      - name: Install PolyDup
        run: cargo install polydup

      - name: Check for duplicates in PR
        run: |
          polydup scan . --git-diff origin/main..HEAD
          # Exits with code 1 if duplicates found in the diff
          # Exits with code 0 if no duplicates (CI passes)

Common Usage Patterns

1. Check uncommitted changes:

polydup scan . --git-diff HEAD

2. Compare branches:

polydup scan . --git-diff main..feature-branch

3. Check last N commits:

polydup scan . --git-diff HEAD~3..HEAD

4. JSON output for tooling:

polydup scan . --git-diff origin/main..HEAD --format json

How It Works

Runs git diff --name-only --diff-filter=ACMR <range>
Gets list of Added, Copied, Modified, Renamed files
Filters out deleted files (can't scan what doesn't exist)
Scans only those files for duplicates
Exits with code 1 if duplicates found, 0 otherwise

Real-World CI Example

# Before: Scanning entire codebase (50K LOC, 500 files)
polydup scan ./src  # 🐢 Takes 15-20 seconds

# After: Git-diff mode (PR with 5 changed files)
polydup scan . --git-diff origin/main..HEAD  # ⚡ Takes 0.5-1 second

10-100x speedup on large codebases with focused PRs!

Edge Cases Handled

✅ Deleted files - Automatically filtered out (can't scan deleted code)
✅ Renamed files - Detected via --diff-filter=R, scanned correctly
✅ Shallow clones - Works in CI environments with fetch-depth: 0
✅ Invalid ranges - Clear error message with suggestions

When to Use Git-Diff vs Baseline

Use Git-Diff Mode (recommended):

✅ Pull Request checks in CI/CD
✅ Fast feedback on code changes
✅ Git-based workflows

Use Baseline Mode when:

✅ Non-git workflows (Perforce, SVN, etc.)
✅ Tracking historical debt reduction
✅ Explicit acceptance of legacy duplicates

Advanced Features

Filtering by Clone Type

Focus on specific types of duplicates for targeted refactoring:

# Show only exact duplicates (highest priority)
polydup scan ./src --only-type type-1

# Show only renamed duplicates
polydup scan ./src --only-type type-2

# Show both Type-1 and Type-2
polydup scan ./src --only-type type-1,type-2

# Exclude noisy Type-3 matches
polydup scan ./src --exclude-type type-3

Use cases:

--only-type type-1: Quick wins for immediate refactoring
--only-type type-2: Identify abstraction opportunities
--exclude-type type-3: Reduce false positives in large codebases

Grouping Results

Organize duplicates for different workflows:

# Group by file (refactoring prioritization)
polydup scan ./src --group-by file

# Group by similarity (quality triage)
polydup scan ./src --group-by similarity

# Group by clone type (targeted cleanup)
polydup scan ./src --group-by type

# Group by size (impact analysis)
polydup scan ./src --group-by size

Grouping strategies:

file: See which files need refactoring most
similarity: Prioritize high-confidence matches
type: Handle Type-1 separately from Type-2
size: Focus on large duplicates for maximum impact

Output Options

# Save report to file
polydup scan ./src --output duplicates.txt

# JSON for CI/CD pipelines
polydup scan ./src --format json --output report.json

# Disable colors for logs
polydup scan ./src --no-color

# Or use NO_COLOR environment variable
NO_COLOR=1 polydup scan ./src

# Verbose mode with performance stats
polydup scan ./src --verbose

Debug Mode

Enhanced error messages with actionable suggestions:

# Enable debug mode for troubleshooting
polydup scan ./src --debug

# Debug mode shows:
# - Current working directory
# - File access permissions
# - Parser errors with context
# - Configuration validation details

Example error output:

Error: Path does not exist: /nonexistent/path

Suggestion: Check the path spelling and ensure it exists
  Example: polydup scan ./src
           polydup scan /absolute/path/to/project

Debug Info: Current directory: /Users/you/project

Combining Features

Mix and match for powerful workflows:

# High-priority refactoring targets
polydup scan ./src \
  --only-type type-1 \
  --group-by file \
  --min-block-size 50 \
  --output refactor-priorities.txt

# CI/CD duplicate gate
polydup scan ./src \
  --threshold 0.95 \
  --exclude-type type-3 \
  --format json \
  --output duplicates.json

# Deep analysis with verbose stats
polydup scan ./src \
  --enable-type3 \
  --group-by similarity \
  --verbose

# Quick triage without noise
polydup scan ./src \
  --only-type type-1,type-2 \
  --group-by type \
  --no-color

Dashboard Output

PolyDup provides a professional dashboard with actionable insights:

╔═══════════════════════════════════════════════════════════╗
║                      Scan Results                         ║
╠═══════════════════════════════════════════════════════════╣
║ Files scanned:       142                                  ║
║ Functions analyzed:  287                                  ║
║ Duplicates found:    15                                   ║
║ Estimated savings:   ~450 lines                           ║
╠═══════════════════════════════════════════════════════════╣
║ Clone Type Breakdown:                                     ║
║   Type-1 (exact):    5 groups  │ Critical priority       ║
║   Type-2 (renamed):  8 groups  │ High priority           ║
║   Type-3 (modified): 2 groups  │ Medium priority         ║
╠═══════════════════════════════════════════════════════════╣
║ Top Offenders:                                            ║
║   1. src/handlers.ts      8 duplicates                    ║
║   2. lib/utils.ts         5 duplicates                    ║
║   3. components/Form.tsx  3 duplicates                    ║
╚═══════════════════════════════════════════════════════════╝

Duplicate #1 (Type-2: Renamed identifiers)
  Location: src/auth.ts:45-68 ↔ src/admin.ts:120-143
  Similarity: 94.2% | Length: 24 lines
  ...

Dashboard features:

Lines saved estimation: Potential code reduction
Top offenders: Files needing most attention
Similarity range: Quality distribution (min-max)
Priority labels: Critical (Type-1), High (Type-2), Medium (Type-3)

Exit Codes

PolyDup uses semantic exit codes for CI/CD integration:

Exit Code	Meaning	Use Case
`0`	No duplicates found	Clean codebase ✓
`1`	Duplicates detected	Quality gate (expected)
`2`	Error occurred	Configuration/runtime issue

CI/CD examples:

# Fail build if duplicates found
polydup scan ./src || exit 1

# Warning only (report but don't fail)
polydup scan ./src || true

# Strict quality gate (fail on any duplicates)
if polydup scan ./src --threshold 0.95; then
  echo "No duplicates found"
else
  echo "⚠️ Duplicates detected - please refactor"
  exit 1
fi

Supported Languages

JavaScript/TypeScript: .js, .jsx, .ts, .tsx
Python: .py
Rust: .rs
Vue: .vue
Svelte: .svelte

More languages coming soon (Java, Go, C/C++, Ruby, PHP)

Clone Types

PolyDup classifies duplicates into different types based on the International Workshop on Software Clones (IWSC) taxonomy:

Type-1: Exact Clones

Identical code fragments except for whitespace, comments, and formatting.

Example:

// File 1
function calculateTotal(items) {
    let sum = 0;
    for (let i = 0; i < items.length; i++) {
        sum += items[i].price;
    }
    return sum;
}

// File 2 (Type-1 clone - only formatting differs)
function calculateTotal(items) {
  let sum = 0;
  for (let i = 0; i < items.length; i++) { sum += items[i].price; }
  return sum;
}

Why they exist: Direct copy-paste without any modifications.

Type-2: Renamed/Parameterized Clones

Structurally identical code with renamed identifiers, changed literals, or different types.

Example:

// File 1
function calculateTotal(items) {
    let sum = 0;
    for (let i = 0; i < items.length; i++) {
        sum += items[i].price;
    }
    return sum;
}

// File 2 (Type-2 clone - renamed variables, same logic)
function computeSum(products) {
    let total = 0;
    for (let j = 0; j < products.length; j++) {
        total += products[j].cost;
    }
    return total;
}

Why they exist: Copy-paste-modify pattern where developers adapt code for different contexts.

Detection: PolyDup normalizes identifiers and literals (e.g., sum → @@ID, 0 → @@NUM) to detect structural similarity.

Type-3: Near-Miss Clones (Not Yet Implemented)

Similar code with minor modifications like inserted/deleted statements or changed expressions.

Example:

// File 1
function processOrder(order) {
    validateOrder(order);
    let total = calculateTotal(order.items);
    applyDiscount(total, order.coupon);
    return total;
}

// File 2 (Type-3 clone - added logging, changed discount logic)
function processOrder(order) {
    validateOrder(order);
    console.log("Processing order:", order.id);  // ADDED
    let total = calculateTotal(order.items);
    let discount = order.coupon ? 0.1 : 0;      // MODIFIED
    total *= (1 - discount);                     // MODIFIED
    return total;
}

Why they exist: Code evolution, bug fixes, or feature additions that slightly modify duplicated logic.

Type-4: Semantic Clones (Not Yet Implemented)

Functionally equivalent code with different implementations.

Example:

// File 1 - Imperative loop
function sum(arr) {
    let total = 0;
    for (let i = 0; i < arr.length; i++) {
        total += arr[i];
    }
    return total;
}

// File 2 - Functional approach
function sum(arr) {
    return arr.reduce((acc, val) => acc + val, 0);
}

// File 3 - Recursive
function sum(arr, i = 0) {
    if (i >= arr.length) return 0;
    return arr[i] + sum(arr, i + 1);
}

Why they exist: Different programming paradigms or styles achieving the same result.

Detection Challenge: Requires semantic analysis, control-flow graphs, or ML-based approaches.

Understanding Your Results

When PolyDup reports duplicates, the clone type indicates:

Type-1: Exact copy-paste → Quick win for extraction into shared utilities
Type-2: Adapted copy-paste → Candidate for parameterized functions or generics
Type-3: Modified duplicates → May require refactoring with strategy patterns
Type-4: Semantic equivalence → Consider standardizing on one implementation

Typical Real-World Distribution:

Type-1: 5-10% (rare in mature codebases)
Type-2: 60-70% (most common - copy-paste-modify)
Type-3: 20-30% (evolved duplicates)
Type-4: <5% (requires specialized detection)

Performance Note: PolyDup efficiently handles codebases up to 100K LOC. Tested on real-world projects with detection times under 5 seconds for most repos.

Troubleshooting

Common Issues

"No duplicates found" but you expect some

Possible causes:

Threshold too high: Try lowering --threshold to 0.70-0.80
Block size too large: Reduce --min-block-size to 5-10 lines
Type-3 not enabled: Add --enable-type3 for gap-tolerant matching

# More sensitive scan
polydup scan ./src --threshold 0.70 --min-block-size 5 --enable-type3

"Too many false positives"

Solutions:

Increase threshold: Use --threshold 0.95 for high-confidence matches
Exclude Type-3: Add --exclude-type type-3 to remove noisy matches
Increase block size: Use --min-block-size 50 for substantial duplicates only

# Strict, high-quality scan
polydup scan ./src --threshold 0.95 --exclude-type type-3 --min-block-size 50

"Permission denied" errors

Fix:

# Check file permissions
ls -la /path/to/scan

# Run with proper permissions
chmod +r /path/to/files

# Use --debug to see detailed error info
polydup scan ./src --debug

"Unsupported file type" warnings

Explanation: PolyDup currently supports JavaScript, TypeScript, Python, Rust, Vue, and Svelte. Other file types are skipped automatically.

Workaround:

Wait for language support (check GitHub issues)
Contribute a parser (see CONTRIBUTING.md)

Colors not working in CI/CD

Solution:

# Disable colors explicitly
polydup scan ./src --no-color

# Or use environment variable
NO_COLOR=1 polydup scan ./src

"Out of memory" on large codebases

Solutions:

# Increase minimum block size to reduce memory usage
polydup scan ./src --min-block-size 100

# Scan directories separately
polydup scan ./src
polydup scan ./tests
polydup scan ./lib

# Exclude generated/vendor code
# Create .polyduprc.toml with exclude patterns

Performance Tips

For large codebases (>50K LOC):

Use --min-block-size 50-100 to focus on substantial duplicates
Disable Type-3 detection (it's more computationally expensive)
Use --exclude-type type-3 to skip gap-tolerant matching
Increase --threshold to 0.95 to reduce candidate matches

For monorepos:

Create .polyduprc.toml at root with shared configuration
Use --group-by file to organize results by module
Exclude node_modules, dist, target, etc. in config

For CI/CD:

Cache the polydup binary to speed up pipeline
Use --format json for machine-readable output
Set appropriate exit code handling (0=clean, 1=duplicates, 2=error)

Getting Help

Debug Mode:

# Enable detailed error traces
polydup scan ./src --debug

Verbose Output:

# Show performance statistics
polydup scan ./src --verbose

Report an Issue:

Check existing issues
Include:
- PolyDup version (polydup --version)
- Operating system and architecture
- Command that failed
- Error message with --debug flag
- Sample code if applicable (anonymized)

Community:

GitHub Discussions: Ask questions
GitHub Issues: Report bugs

Development

Building from Source

Prerequisites:

Rust 1.70+ (curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh)
Node.js 16+ (for Node.js bindings)
Python 3.8-3.12 (for Python bindings)

CLI:

git clone https://github.com/wiesnerbernard/polydup.git
cd polydup
cargo build --release -p polydup
./target/release/polydup scan ./src

Node.js bindings:

cd crates/polydup-node
npm install
npm run build
npm test

Python bindings:

cd crates/polydup-py
pip install maturin
maturin develop
python -c "import polydup; print(polydup.version())"

Run tests:

# All tests
cargo test --workspace

# Specific crate
cargo test -p polydup-core

# With coverage
cargo install cargo-tarpaulin
cargo tarpaulin --workspace

Creating a Release

Recommended: Create releases directly from GitHub UI - fully automated, no local tools required!

Go to Releases → New Release
Create a new tag (e.g., v0.2.7)
Click "Publish release"
Everything happens automatically (~5-7 minutes):
- Syncs version files (Cargo.toml, package.json, pyproject.toml)
- Updates CHANGELOG.md with release entry
- Moves tag to version-synced commit (if needed)
- Builds binaries for all 5 platforms (macOS/Linux/Windows)
- Publishes to crates.io, npm, and PyPI
- Creates release with binary assets
- Zero manual steps required - truly one-click releases!

Alternative: Use the release script locally:

./scripts/release.sh 0.2.5

See docs/RELEASE.md for detailed instructions.

Pre-commit Hooks

Install pre-commit hooks to automatically run linting and tests:

# Install pre-commit (if not already installed)
pip install pre-commit

# Install the git hooks
pre-commit install
pre-commit install -t pre-push

# Run manually on all files
pre-commit run --all-files

The hooks will automatically run:

On commit: cargo fmt, cargo clippy, file checks (trailing whitespace, YAML/TOML validation)
On push: Full test suite with cargo test

To skip hooks temporarily:

git commit --no-verify

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Install pre-commit hooks (pre-commit install)
Make your changes and ensure tests pass (cargo test --workspace)
Run clippy (cargo clippy --workspace --all-targets -- -D warnings)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

See CONTRIBUTING.md for detailed guidelines.

License

MIT OR Apache-2.0

polydup-core 0.8.1

PolyDup

Features

Architecture

Installation

GitHub Action (Easiest for CI/CD) 🚀

Action Inputs

Action Outputs

Using Outputs in Workflows

Example PR Comment

Rust CLI (Recommended for Local Development)

Node.js/npm (Library Only)

Python/pip (Library Only)

Rust Library

Building from Source

CLI

Node.js

Python

CLI Usage

Quick Start with polydup init

Configuration File (.polyduprc.toml)

Basic Commands

Examples

Output Formats

Commands

CLI Options

Baseline/Snapshot Mode

Use Case: "We have existing duplicates, but block any NEW ones"

Quick Start

Advanced Baseline Workflows

Real-world Example: PR Comments

Git-Diff Mode 🚀 RECOMMENDED FOR CI/CD

Why Git-Diff Mode?

Quick Start

Common Usage Patterns

How It Works

Real-World CI Example

Edge Cases Handled

When to Use Git-Diff vs Baseline

Advanced Features

Filtering by Clone Type

Grouping Results

Output Options

Debug Mode

Combining Features

Dashboard Output

Exit Codes

Supported Languages

Clone Types

Type-1: Exact Clones

Type-2: Renamed/Parameterized Clones

Type-3: Near-Miss Clones (Not Yet Implemented)

Type-4: Semantic Clones (Not Yet Implemented)

Understanding Your Results

Troubleshooting

Common Issues

"No duplicates found" but you expect some

"Too many false positives"

"Permission denied" errors

"Unsupported file type" warnings

Colors not working in CI/CD

"Out of memory" on large codebases

Performance Tips

Getting Help

Development

Building from Source

Creating a Release

Pre-commit Hooks

Contributing

License

Quick Start with `polydup init`

Configuration File (`.polyduprc.toml`)