Sieve

Intelligent URL deduplication for offensive security workflows. Reduce 10,000 UUID-varied endpoints to 80 unique attack surfaces.

The Problem

Modern crawlers (katana, hakrawler) and fuzzers output massive lists of URLs. Feeding 10,000 variations of the same endpoint (e.g., /api/v1/users/{uuid}/profile) into scanners like nuclei or sqlmap wastes compute time, burns bandwidth, and triggers rate limiting that gets your IP banned by WAFs.

Why Sieve?

Feature	`anew`	`uro`	urlsieve
Semantic path normalization	❌	✅	✅
Streaming (zero-latency)	❌	❌	✅
Auto-learn patterns from data	❌	❌	✅
Configurable normalization rules	❌	❌	✅

Install

cargo install urlsieve

Or build from source:

cargo build --release
./target/release/urlsieve --help

Quick Start

# Deduplicate a file (reads from stdin by default)
cat urls.txt | urlsieve

# Read from file, write to file
urlsieve -i urls.txt -o deduped.txt

# Show statistics
urlsieve -i urls.txt --stats

Note: Streaming mode (default, no --sort) emits the first seen URL per group. For deterministic output (lexicographically smallest representative), use --sort.

Real-World Results

Dataset	Input URLs	Unique Fingerprints	Reduction
REST API (marketplace)	1,706	81	95%
Mixed crawl (GAU output)	498	319	36%

REST APIs with UUID/numeric IDs compress aggressively. Static asset-heavy sites (WordPress, Next.js) compress less -- Sieve correctly preserves semantically distinct paths.

How It Works

Each URL is parsed and a fingerprint is generated by normalizing dynamic path segments and query parameters:

Segment Example	Detected As	Fingerprint
`df8b8a77-6f3e-4733-978c-f0b8fa28b0a4`	UUID	`{uuid}`
`a1b2c3d4e5f6a7b8c9d0`	Hash	`{hash}`
`57ea1f72-7abe-4d77-be04-94037844e8a2`	MongoDB ObjectId	`{mongo}`
`01ARZ3NDEKTSV4RRFFQ69G5FAV`	ULID	`{ulid}`
`12345678`	Numeric ID	`{id}`
`1706832000`	Epoch	`{epoch}`
`2024-01-15`	ISO date	`{date}`
`dGhpcyBpcyBhIHRva2Vu`	Base64	`{token}`
`23c6DSKX`	Short token	`{slug}`

Query parameter values are analyzed via Shannon entropy to distinguish dynamic tokens from static values. Common auth/cache-bust keys (token, session, ts, _, cb) are always normalized.

Output Formats

# rep (default) -- one representative URL per group
urlsieve -i urls.txt --format rep

# counted -- URL with duplicate count as inline comment
urlsieve -i urls.txt --format counted

# json -- single JSON object with full structure
urlsieve -i urls.txt --format json

# jsonl -- one JSON object per line (stream-friendly)
urlurlsieve -i urls.txt --format jsonl

Streaming Mode (Zero-Latency Pipeline)

When using --format rep or --format jsonl without --sort, Sieve operates in streaming mode: URLs are emitted to stdout immediately as they are read, with zero buffering of the full result set. Only a u64 hash per unique fingerprint is kept in memory.

This enables real-time pipelines where downstream tools start processing the first URL instantly:

# nuclei begins scanning immediately -- no wait for full dedup
cat 50M_urls.txt | urlsieve | nuclei -l -

# Same with httpx
urlsieve -i urls.txt | httpx -silent -status-code

Streaming mode is automatically disabled when --sort, --format counted, --format json, or --invalid-output is used, as these require the full dataset before output.

Diff Mode

Compare a new URL list against a previously seen baseline, outputting only new (not previously seen) URLs:

# Fingerprint-based diff (new structural paths only)
urlsieve --diff baseline.txt -i new_urls.txt

# Strict diff (exact URL match)
urlsieve --diff baseline.txt --diff-strict -i new_urls.txt

# Strip query params before comparison
urlsieve --diff baseline.txt --strip-query -i new_urls.txt

Learn Mode

Analyze URL cardinality to determine which path segments and query parameters are dynamic, then automatically generate a config:

# Print cardinality report (goes to stderr)
urlsieve --learn -i urls.txt

# Save learned config to TOML
urlsieve --learn --save-config learned.toml -i urls.txt

# Analyze and immediately apply the learned config
urlsieve --learn --apply -i urls.txt --stats

Note: --learn infers path segment patterns from a sample of up to 10 unique values. Positions with extremely high cardinality (>500 unique values) are skipped to avoid generating overly broad patterns — those segments are handled by the entropy detector instead. Only UUID and numeric patterns are auto-generated; other types fall back to entropy-based detection.

Performance

Sieve is engineered to handle pipelines with tens of millions of URLs:

Throughput: ~425k URLs/sec on a modern processor (full pipeline: parse + fingerprint + dedup).
Memory: ~8 bytes per unique fingerprint in streaming mode (only a u64 hash is stored).
Batch mode: Holds full result set in memory; suitable for datasets up to several million URLs.
Reusable fingerprint buffer: A single pre-allocated String is reused across all URLs, avoiding heap allocation per URL in the hot path.
Single-pass regex matching: All detection patterns (UUID, hash, ULID, epoch, etc.) are compiled into one RegexSet and evaluated simultaneously, not sequentially.
Fast hashing: Uses Rapidhash for fingerprint deduplication in streaming mode -- one of the fastest non-cryptographic hash algorithms, passing all SMHasher quality tests.

Configuration

Sieve uses a TOML config file with sensible defaults. Override with -c:

urlsieve -i urls.txt -c myconfig.toml

Default config:

[general]
min_segment_len = 8
entropy_threshold = 3.5
patterns = ["uuid", "hash", "numid", "timestamp", "epoch", "base64", "mongo", "ulid", "short_token", "entropy"]

[normalize_params]
always_normalize = ["token", "session", "session_id", "user_id", "auth", "api_key", "jwt", "access_token", "refresh_token", "csrf", "signature", "ts", "timestamp", "_", "cb", "cachebust", "rand", "seed"]
never_normalize = ["page", "limit", "offset", "sort", "order", "format", "callback"]

[structural]
literal_segments = ["api", "graphql", "health", "status", "favicon.ico", "robots.txt"]
pattern_segments = ["v\\d+"]

CLI Reference

Flag	Description
`-i, --input`	Input file (reads stdin if omitted)
`-o, --output`	Output file (writes stdout if omitted)
`-f, --format`	Output format: `rep`, `counted`, `json`, `jsonl` (default: `rep`)
`-c, --config`	Config file (TOML)
`--stats`	Show deduplication statistics
`--patterns`	Patterns to enable (comma-separated, or `all`)
`--min-segment-len`	Minimum segment length for entropy check
`--entropy-threshold`	Shannon entropy threshold for dynamic detection
`--normalize-param-keys`	Param keys whose values are always normalized (comma-separated)
`--keep-param-keys`	Param keys whose values are never normalized (comma-separated)
`--strip-query`	Remove query params from fingerprint entirely
`--assume-scheme`	Prepend scheme to scheme-less URLs (default: `https`)
`--invalid-output`	Write invalid/malformed URLs to file
`--learn`	Analyze cardinality and print report
`--apply`	Apply learned config during analysis (requires `--learn`)
`--save-config`	Save learned config to TOML file (requires `--learn`)
`--diff`	Compare against baseline file (fingerprint match)
`--diff-strict`	Use exact URL matching in diff mode
`--sort`	Sort output by fingerprint (deterministic, disables streaming)

Common Workflows

# Basic dedup with stats
urlsieve -i urls.txt --stats

# Dedup, then scan with nuclei
urlsieve -i urls.txt | nuclei -l -

# Diff against yesterday's scan
urlsieve --diff yesterday.txt -i today.txt --stats

# Learn from a large dataset, then apply
urlsieve --learn --save-config recon.toml -i all_urls.txt
urlsieve -i new_urls.txt -c recon.toml --stats

# Analyze and immediately apply learned config in one step
urlsieve --learn --apply -i urls.txt --stats

# Stream JSONL output for downstream processing
urlurlurlsieve -i urls.txt --format jsonl | jq -r '.representative'

# Dedup scheme-less URLs
urlsieve -i hosts.txt --assume-scheme https

# Dedup ignoring query params (useful when only params differ)
urlsieve -i urls.txt --strip-query --stats

# Save invalid URLs for inspection
urlsieve -i urls.txt --invalid-output invalid.txt --stats

What Counts as an Invalid URL?

Sieve rejects URLs that cannot be parsed as valid HTTP/HTTPS:

ftp://, file://, javascript: schemes → rejected
Invalid percent-encoding (e.g., %GG) → decoded lossily, still processed
Malformed URLs with no host → rejected
Protocol-relative URLs (//cdn.example.com/file.js) → accepted, https assumed

Invalid URLs are silently dropped in streaming mode. Use --invalid-output in batch mode to collect them for inspection.

License

MIT. See LICENSE.

urlsieve 0.1.0