urlsieve 0.1.0

Intelligent URL deduplication tool for bug bounty workflows
Documentation

Sieve

Crates.io License: MIT

Intelligent URL deduplication for offensive security workflows. Reduce 10,000 UUID-varied endpoints to 80 unique attack surfaces.

The Problem

Modern crawlers (katana, hakrawler) and fuzzers output massive lists of URLs. Feeding 10,000 variations of the same endpoint (e.g., /api/v1/users/{uuid}/profile) into scanners like nuclei or sqlmap wastes compute time, burns bandwidth, and triggers rate limiting that gets your IP banned by WAFs.

Why Sieve?

Feature anew uro urlsieve
Semantic path normalization
Streaming (zero-latency)
Auto-learn patterns from data
Configurable normalization rules

Install

cargo install urlsieve

Or build from source:

cargo build --release
./target/release/urlsieve --help

Quick Start

# Deduplicate a file (reads from stdin by default)
cat urls.txt | urlsieve

# Read from file, write to file
urlsieve -i urls.txt -o deduped.txt

# Show statistics
urlsieve -i urls.txt --stats

Note: Streaming mode (default, no --sort) emits the first seen URL per group. For deterministic output (lexicographically smallest representative), use --sort.

Real-World Results

Dataset Input URLs Unique Fingerprints Reduction
REST API (marketplace) 1,706 81 95%
Mixed crawl (GAU output) 498 319 36%

REST APIs with UUID/numeric IDs compress aggressively. Static asset-heavy sites (WordPress, Next.js) compress less -- Sieve correctly preserves semantically distinct paths.

How It Works

Each URL is parsed and a fingerprint is generated by normalizing dynamic path segments and query parameters:

Segment Example Detected As Fingerprint
df8b8a77-6f3e-4733-978c-f0b8fa28b0a4 UUID {uuid}
a1b2c3d4e5f6a7b8c9d0 Hash {hash}
57ea1f72-7abe-4d77-be04-94037844e8a2 MongoDB ObjectId {mongo}
01ARZ3NDEKTSV4RRFFQ69G5FAV ULID {ulid}
12345678 Numeric ID {id}
1706832000 Epoch {epoch}
2024-01-15 ISO date {date}
dGhpcyBpcyBhIHRva2Vu Base64 {token}
23c6DSKX Short token {slug}

Query parameter values are analyzed via Shannon entropy to distinguish dynamic tokens from static values. Common auth/cache-bust keys (token, session, ts, _, cb) are always normalized.

Output Formats

# rep (default) -- one representative URL per group
urlsieve -i urls.txt --format rep

# counted -- URL with duplicate count as inline comment
urlsieve -i urls.txt --format counted

# json -- single JSON object with full structure
urlsieve -i urls.txt --format json

# jsonl -- one JSON object per line (stream-friendly)
urlurlsieve -i urls.txt --format jsonl

Streaming Mode (Zero-Latency Pipeline)

When using --format rep or --format jsonl without --sort, Sieve operates in streaming mode: URLs are emitted to stdout immediately as they are read, with zero buffering of the full result set. Only a u64 hash per unique fingerprint is kept in memory.

This enables real-time pipelines where downstream tools start processing the first URL instantly:

# nuclei begins scanning immediately -- no wait for full dedup
cat 50M_urls.txt | urlsieve | nuclei -l -

# Same with httpx
urlsieve -i urls.txt | httpx -silent -status-code

Streaming mode is automatically disabled when --sort, --format counted, --format json, or --invalid-output is used, as these require the full dataset before output.

Diff Mode

Compare a new URL list against a previously seen baseline, outputting only new (not previously seen) URLs:

# Fingerprint-based diff (new structural paths only)
urlsieve --diff baseline.txt -i new_urls.txt

# Strict diff (exact URL match)
urlsieve --diff baseline.txt --diff-strict -i new_urls.txt

# Strip query params before comparison
urlsieve --diff baseline.txt --strip-query -i new_urls.txt

Learn Mode

Analyze URL cardinality to determine which path segments and query parameters are dynamic, then automatically generate a config:

# Print cardinality report (goes to stderr)
urlsieve --learn -i urls.txt

# Save learned config to TOML
urlsieve --learn --save-config learned.toml -i urls.txt

# Analyze and immediately apply the learned config
urlsieve --learn --apply -i urls.txt --stats

Note: --learn infers path segment patterns from a sample of up to 10 unique values. Positions with extremely high cardinality (>500 unique values) are skipped to avoid generating overly broad patterns — those segments are handled by the entropy detector instead. Only UUID and numeric patterns are auto-generated; other types fall back to entropy-based detection.

Performance

Sieve is engineered to handle pipelines with tens of millions of URLs:

  • Throughput: ~425k URLs/sec on a modern processor (full pipeline: parse + fingerprint + dedup).
  • Memory: ~8 bytes per unique fingerprint in streaming mode (only a u64 hash is stored).
  • Batch mode: Holds full result set in memory; suitable for datasets up to several million URLs.
  • Reusable fingerprint buffer: A single pre-allocated String is reused across all URLs, avoiding heap allocation per URL in the hot path.
  • Single-pass regex matching: All detection patterns (UUID, hash, ULID, epoch, etc.) are compiled into one RegexSet and evaluated simultaneously, not sequentially.
  • Fast hashing: Uses Rapidhash for fingerprint deduplication in streaming mode -- one of the fastest non-cryptographic hash algorithms, passing all SMHasher quality tests.

Configuration

Sieve uses a TOML config file with sensible defaults. Override with -c:

urlsieve -i urls.txt -c myconfig.toml

Default config:

[general]
min_segment_len = 8
entropy_threshold = 3.5
patterns = ["uuid", "hash", "numid", "timestamp", "epoch", "base64", "mongo", "ulid", "short_token", "entropy"]

[normalize_params]
always_normalize = ["token", "session", "session_id", "user_id", "auth", "api_key", "jwt", "access_token", "refresh_token", "csrf", "signature", "ts", "timestamp", "_", "cb", "cachebust", "rand", "seed"]
never_normalize = ["page", "limit", "offset", "sort", "order", "format", "callback"]

[structural]
literal_segments = ["api", "graphql", "health", "status", "favicon.ico", "robots.txt"]
pattern_segments = ["v\\d+"]

CLI Reference

Flag Description
-i, --input Input file (reads stdin if omitted)
-o, --output Output file (writes stdout if omitted)
-f, --format Output format: rep, counted, json, jsonl (default: rep)
-c, --config Config file (TOML)
--stats Show deduplication statistics
--patterns Patterns to enable (comma-separated, or all)
--min-segment-len Minimum segment length for entropy check
--entropy-threshold Shannon entropy threshold for dynamic detection
--normalize-param-keys Param keys whose values are always normalized (comma-separated)
--keep-param-keys Param keys whose values are never normalized (comma-separated)
--strip-query Remove query params from fingerprint entirely
--assume-scheme Prepend scheme to scheme-less URLs (default: https)
--invalid-output Write invalid/malformed URLs to file
--learn Analyze cardinality and print report
--apply Apply learned config during analysis (requires --learn)
--save-config Save learned config to TOML file (requires --learn)
--diff Compare against baseline file (fingerprint match)
--diff-strict Use exact URL matching in diff mode
--sort Sort output by fingerprint (deterministic, disables streaming)

Common Workflows

# Basic dedup with stats
urlsieve -i urls.txt --stats

# Dedup, then scan with nuclei
urlsieve -i urls.txt | nuclei -l -

# Diff against yesterday's scan
urlsieve --diff yesterday.txt -i today.txt --stats

# Learn from a large dataset, then apply
urlsieve --learn --save-config recon.toml -i all_urls.txt
urlsieve -i new_urls.txt -c recon.toml --stats

# Analyze and immediately apply learned config in one step
urlsieve --learn --apply -i urls.txt --stats

# Stream JSONL output for downstream processing
urlurlurlsieve -i urls.txt --format jsonl | jq -r '.representative'

# Dedup scheme-less URLs
urlsieve -i hosts.txt --assume-scheme https

# Dedup ignoring query params (useful when only params differ)
urlsieve -i urls.txt --strip-query --stats

# Save invalid URLs for inspection
urlsieve -i urls.txt --invalid-output invalid.txt --stats

What Counts as an Invalid URL?

Sieve rejects URLs that cannot be parsed as valid HTTP/HTTPS:

  • ftp://, file://, javascript: schemes → rejected
  • Invalid percent-encoding (e.g., %GG) → decoded lossily, still processed
  • Malformed URLs with no host → rejected
  • Protocol-relative URLs (//cdn.example.com/file.js) → accepted, https assumed

Invalid URLs are silently dropped in streaming mode. Use --invalid-output in batch mode to collect them for inspection.

License

MIT. See LICENSE.