Sieve
Intelligent URL deduplication for offensive security workflows. Reduce 10,000 UUID-varied endpoints to 80 unique attack surfaces.
The Problem
Modern crawlers (katana, hakrawler) and fuzzers output massive lists of URLs. Feeding 10,000 variations of the same endpoint (e.g., /api/v1/users/{uuid}/profile) into scanners like nuclei or sqlmap wastes compute time, burns bandwidth, and triggers rate limiting that gets your IP banned by WAFs.
Why Sieve?
| Feature | anew |
uro |
urlsieve |
|---|---|---|---|
| Semantic path normalization | ❌ | ✅ | ✅ |
| Streaming (zero-latency) | ❌ | ❌ | ✅ |
| Auto-learn patterns from data | ❌ | ❌ | ✅ |
| Configurable normalization rules | ❌ | ❌ | ✅ |
Install
Or build from source:
Quick Start
# Deduplicate a file (reads from stdin by default)
|
# Read from file, write to file
# Show statistics
Note: Streaming mode (default, no
--sort) emits the first seen URL per group. For deterministic output (lexicographically smallest representative), use--sort.
Real-World Results
| Dataset | Input URLs | Unique Fingerprints | Reduction |
|---|---|---|---|
| REST API (marketplace) | 1,706 | 81 | 95% |
| Mixed crawl (GAU output) | 498 | 319 | 36% |
REST APIs with UUID/numeric IDs compress aggressively. Static asset-heavy sites (WordPress, Next.js) compress less -- Sieve correctly preserves semantically distinct paths.
How It Works
Each URL is parsed and a fingerprint is generated by normalizing dynamic path segments and query parameters:
| Segment Example | Detected As | Fingerprint |
|---|---|---|
df8b8a77-6f3e-4733-978c-f0b8fa28b0a4 |
UUID | {uuid} |
a1b2c3d4e5f6a7b8c9d0 |
Hash | {hash} |
57ea1f72-7abe-4d77-be04-94037844e8a2 |
MongoDB ObjectId | {mongo} |
01ARZ3NDEKTSV4RRFFQ69G5FAV |
ULID | {ulid} |
12345678 |
Numeric ID | {id} |
1706832000 |
Epoch | {epoch} |
2024-01-15 |
ISO date | {date} |
dGhpcyBpcyBhIHRva2Vu |
Base64 | {token} |
23c6DSKX |
Short token | {slug} |
Query parameter values are analyzed via Shannon entropy to distinguish dynamic tokens from static values. Common auth/cache-bust keys (token, session, ts, _, cb) are always normalized.
Output Formats
# rep (default) -- one representative URL per group
# counted -- URL with duplicate count as inline comment
# json -- single JSON object with full structure
# jsonl -- one JSON object per line (stream-friendly)
Streaming Mode (Zero-Latency Pipeline)
When using --format rep or --format jsonl without --sort, Sieve operates in streaming mode: URLs are emitted to stdout immediately as they are read, with zero buffering of the full result set. Only a u64 hash per unique fingerprint is kept in memory.
This enables real-time pipelines where downstream tools start processing the first URL instantly:
# nuclei begins scanning immediately -- no wait for full dedup
| |
# Same with httpx
|
Streaming mode is automatically disabled when --sort, --format counted, --format json, or --invalid-output is used, as these require the full dataset before output.
Diff Mode
Compare a new URL list against a previously seen baseline, outputting only new (not previously seen) URLs:
# Fingerprint-based diff (new structural paths only)
# Strict diff (exact URL match)
# Strip query params before comparison
Learn Mode
Analyze URL cardinality to determine which path segments and query parameters are dynamic, then automatically generate a config:
# Print cardinality report (goes to stderr)
# Save learned config to TOML
# Analyze and immediately apply the learned config
Note:
--learninfers path segment patterns from a sample of up to 10 unique values. Positions with extremely high cardinality (>500 unique values) are skipped to avoid generating overly broad patterns — those segments are handled by the entropy detector instead. Only UUID and numeric patterns are auto-generated; other types fall back to entropy-based detection.
Performance
Sieve is engineered to handle pipelines with tens of millions of URLs:
- Throughput: ~425k URLs/sec on a modern processor (full pipeline: parse + fingerprint + dedup).
- Memory: ~8 bytes per unique fingerprint in streaming mode (only a
u64hash is stored). - Batch mode: Holds full result set in memory; suitable for datasets up to several million URLs.
- Reusable fingerprint buffer: A single pre-allocated
Stringis reused across all URLs, avoiding heap allocation per URL in the hot path. - Single-pass regex matching: All detection patterns (UUID, hash, ULID, epoch, etc.) are compiled into one
RegexSetand evaluated simultaneously, not sequentially. - Fast hashing: Uses
Rapidhashfor fingerprint deduplication in streaming mode -- one of the fastest non-cryptographic hash algorithms, passing all SMHasher quality tests.
Configuration
Sieve uses a TOML config file with sensible defaults. Override with -c:
Default config:
[]
= 8
= 3.5
= ["uuid", "hash", "numid", "timestamp", "epoch", "base64", "mongo", "ulid", "short_token", "entropy"]
[]
= ["token", "session", "session_id", "user_id", "auth", "api_key", "jwt", "access_token", "refresh_token", "csrf", "signature", "ts", "timestamp", "_", "cb", "cachebust", "rand", "seed"]
= ["page", "limit", "offset", "sort", "order", "format", "callback"]
[]
= ["api", "graphql", "health", "status", "favicon.ico", "robots.txt"]
= ["v\\d+"]
CLI Reference
| Flag | Description |
|---|---|
-i, --input |
Input file (reads stdin if omitted) |
-o, --output |
Output file (writes stdout if omitted) |
-f, --format |
Output format: rep, counted, json, jsonl (default: rep) |
-c, --config |
Config file (TOML) |
--stats |
Show deduplication statistics |
--patterns |
Patterns to enable (comma-separated, or all) |
--min-segment-len |
Minimum segment length for entropy check |
--entropy-threshold |
Shannon entropy threshold for dynamic detection |
--normalize-param-keys |
Param keys whose values are always normalized (comma-separated) |
--keep-param-keys |
Param keys whose values are never normalized (comma-separated) |
--strip-query |
Remove query params from fingerprint entirely |
--assume-scheme |
Prepend scheme to scheme-less URLs (default: https) |
--invalid-output |
Write invalid/malformed URLs to file |
--learn |
Analyze cardinality and print report |
--apply |
Apply learned config during analysis (requires --learn) |
--save-config |
Save learned config to TOML file (requires --learn) |
--diff |
Compare against baseline file (fingerprint match) |
--diff-strict |
Use exact URL matching in diff mode |
--sort |
Sort output by fingerprint (deterministic, disables streaming) |
Common Workflows
# Basic dedup with stats
# Dedup, then scan with nuclei
|
# Diff against yesterday's scan
# Learn from a large dataset, then apply
# Analyze and immediately apply learned config in one step
# Stream JSONL output for downstream processing
|
# Dedup scheme-less URLs
# Dedup ignoring query params (useful when only params differ)
# Save invalid URLs for inspection
What Counts as an Invalid URL?
Sieve rejects URLs that cannot be parsed as valid HTTP/HTTPS:
ftp://,file://,javascript:schemes → rejected- Invalid percent-encoding (e.g.,
%GG) → decoded lossily, still processed - Malformed URLs with no host → rejected
- Protocol-relative URLs (
//cdn.example.com/file.js) → accepted,httpsassumed
Invalid URLs are silently dropped in streaming mode. Use --invalid-output in batch mode to collect them for inspection.
License
MIT. See LICENSE.