# Sieve
[](https://crates.io/crates/urlsieve)
[](LICENSE)
> Intelligent URL deduplication for offensive security workflows.
> Reduce 10,000 UUID-varied endpoints to 80 unique attack surfaces.
## The Problem
Modern crawlers (katana, hakrawler) and fuzzers output massive lists of URLs. Feeding 10,000 variations of the same endpoint (e.g., `/api/v1/users/{uuid}/profile`) into scanners like `nuclei` or `sqlmap` wastes compute time, burns bandwidth, and triggers rate limiting that gets your IP banned by WAFs.
## Why Sieve?
| Semantic path normalization | ❌ | ✅ | ✅ |
| Streaming (zero-latency) | ❌ | ❌ | ✅ |
| Auto-learn patterns from data | ❌ | ❌ | ✅ |
| Configurable normalization rules | ❌ | ❌ | ✅ |
## Install
```bash
cargo install urlsieve
```
Or build from source:
```bash
cargo build --release
./target/release/urlsieve --help
```
## Quick Start
```bash
# Deduplicate a file (reads from stdin by default)
# Read from file, write to file
urlsieve -i urls.txt -o deduped.txt
# Show statistics
urlsieve -i urls.txt --stats
```
> **Note:** Streaming mode (default, no `--sort`) emits the first seen URL per group. For deterministic output (lexicographically smallest representative), use `--sort`.
## Real-World Results
| REST API (marketplace) | 1,706 | 81 | 95% |
| Mixed crawl (GAU output) | 498 | 319 | 36% |
REST APIs with UUID/numeric IDs compress aggressively. Static asset-heavy sites (WordPress, Next.js) compress less -- Sieve correctly preserves semantically distinct paths.
## How It Works
Each URL is parsed and a **fingerprint** is generated by normalizing dynamic path segments and query parameters:
| `df8b8a77-6f3e-4733-978c-f0b8fa28b0a4` | UUID | `{uuid}` |
| `a1b2c3d4e5f6a7b8c9d0` | Hash | `{hash}` |
| `57ea1f72-7abe-4d77-be04-94037844e8a2` | MongoDB ObjectId | `{mongo}` |
| `01ARZ3NDEKTSV4RRFFQ69G5FAV` | ULID | `{ulid}` |
| `12345678` | Numeric ID | `{id}` |
| `1706832000` | Epoch | `{epoch}` |
| `2024-01-15` | ISO date | `{date}` |
| `dGhpcyBpcyBhIHRva2Vu` | Base64 | `{token}` |
| `23c6DSKX` | Short token | `{slug}` |
Query parameter values are analyzed via Shannon entropy to distinguish dynamic tokens from static values. Common auth/cache-bust keys (`token`, `session`, `ts`, `_`, `cb`) are always normalized.
## Output Formats
```bash
# rep (default) -- one representative URL per group
urlsieve -i urls.txt --format rep
# counted -- URL with duplicate count as inline comment
urlsieve -i urls.txt --format counted
# json -- single JSON object with full structure
urlsieve -i urls.txt --format json
# jsonl -- one JSON object per line (stream-friendly)
urlurlsieve -i urls.txt --format jsonl
```
## Streaming Mode (Zero-Latency Pipeline)
When using `--format rep` or `--format jsonl` without `--sort`, Sieve operates in streaming mode: URLs are emitted to stdout immediately as they are read, with zero buffering of the full result set. Only a `u64` hash per unique fingerprint is kept in memory.
This enables real-time pipelines where downstream tools start processing the first URL instantly:
```bash
# nuclei begins scanning immediately -- no wait for full dedup
cat 50M_urls.txt | urlsieve | nuclei -l -
# Same with httpx
Streaming mode is automatically disabled when `--sort`, `--format counted`, `--format json`, or `--invalid-output` is used, as these require the full dataset before output.
## Diff Mode
Compare a new URL list against a previously seen baseline, outputting only new (not previously seen) URLs:
```bash
# Fingerprint-based diff (new structural paths only)
urlsieve --diff baseline.txt -i new_urls.txt
# Strict diff (exact URL match)
urlsieve --diff baseline.txt --diff-strict -i new_urls.txt
# Strip query params before comparison
urlsieve --diff baseline.txt --strip-query -i new_urls.txt
```
## Learn Mode
Analyze URL cardinality to determine which path segments and query parameters are dynamic, then automatically generate a config:
```bash
# Print cardinality report (goes to stderr)
urlsieve --learn -i urls.txt
# Save learned config to TOML
urlsieve --learn --save-config learned.toml -i urls.txt
# Analyze and immediately apply the learned config
urlsieve --learn --apply -i urls.txt --stats
```
> **Note:** `--learn` infers path segment patterns from a sample of up to 10 unique values. Positions with extremely high cardinality (>500 unique values) are skipped to avoid generating overly broad patterns — those segments are handled by the entropy detector instead. Only UUID and numeric patterns are auto-generated; other types fall back to entropy-based detection.
## Performance
Sieve is engineered to handle pipelines with tens of millions of URLs:
- **Throughput:** ~425k URLs/sec on a modern processor (full pipeline: parse + fingerprint + dedup).
- **Memory:** ~8 bytes per unique fingerprint in streaming mode (only a `u64` hash is stored).
- **Batch mode:** Holds full result set in memory; suitable for datasets up to several million URLs.
- **Reusable fingerprint buffer:** A single pre-allocated `String` is reused across all URLs, avoiding heap allocation per URL in the hot path.
- **Single-pass regex matching:** All detection patterns (UUID, hash, ULID, epoch, etc.) are compiled into one `RegexSet` and evaluated simultaneously, not sequentially.
- **Fast hashing:** Uses `Rapidhash` for fingerprint deduplication in streaming mode -- one of the fastest non-cryptographic hash algorithms, passing all SMHasher quality tests.
## Configuration
Sieve uses a TOML config file with sensible defaults. Override with `-c`:
```bash
urlsieve -i urls.txt -c myconfig.toml
```
Default config:
```toml
[general]
min_segment_len = 8
entropy_threshold = 3.5
patterns = ["uuid", "hash", "numid", "timestamp", "epoch", "base64", "mongo", "ulid", "short_token", "entropy"]
[normalize_params]
always_normalize = ["token", "session", "session_id", "user_id", "auth", "api_key", "jwt", "access_token", "refresh_token", "csrf", "signature", "ts", "timestamp", "_", "cb", "cachebust", "rand", "seed"]
never_normalize = ["page", "limit", "offset", "sort", "order", "format", "callback"]
[structural]
literal_segments = ["api", "graphql", "health", "status", "favicon.ico", "robots.txt"]
pattern_segments = ["v\\d+"]
```
## CLI Reference
| `-i, --input` | Input file (reads stdin if omitted) |
| `-o, --output` | Output file (writes stdout if omitted) |
| `-f, --format` | Output format: `rep`, `counted`, `json`, `jsonl` (default: `rep`) |
| `-c, --config` | Config file (TOML) |
| `--stats` | Show deduplication statistics |
| `--patterns` | Patterns to enable (comma-separated, or `all`) |
| `--min-segment-len` | Minimum segment length for entropy check |
| `--entropy-threshold` | Shannon entropy threshold for dynamic detection |
| `--normalize-param-keys` | Param keys whose values are always normalized (comma-separated) |
| `--keep-param-keys` | Param keys whose values are never normalized (comma-separated) |
| `--strip-query` | Remove query params from fingerprint entirely |
| `--assume-scheme` | Prepend scheme to scheme-less URLs (default: `https`) |
| `--invalid-output` | Write invalid/malformed URLs to file |
| `--learn` | Analyze cardinality and print report |
| `--apply` | Apply learned config during analysis (requires `--learn`) |
| `--save-config` | Save learned config to TOML file (requires `--learn`) |
| `--diff` | Compare against baseline file (fingerprint match) |
| `--diff-strict` | Use exact URL matching in diff mode |
| `--sort` | Sort output by fingerprint (deterministic, disables streaming) |
## Common Workflows
```bash
# Basic dedup with stats
urlsieve -i urls.txt --stats
# Dedup, then scan with nuclei
# Diff against yesterday's scan
urlsieve --diff yesterday.txt -i today.txt --stats
# Learn from a large dataset, then apply
urlsieve --learn --save-config recon.toml -i all_urls.txt
urlsieve -i new_urls.txt -c recon.toml --stats
# Analyze and immediately apply learned config in one step
urlsieve --learn --apply -i urls.txt --stats
# Stream JSONL output for downstream processing
# Dedup scheme-less URLs
urlsieve -i hosts.txt --assume-scheme https
# Dedup ignoring query params (useful when only params differ)
urlsieve -i urls.txt --strip-query --stats
# Save invalid URLs for inspection
urlsieve -i urls.txt --invalid-output invalid.txt --stats
```
## What Counts as an Invalid URL?
Sieve rejects URLs that cannot be parsed as valid HTTP/HTTPS:
- `ftp://`, `file://`, `javascript:` schemes → rejected
- Invalid percent-encoding (e.g., `%GG`) → decoded lossily, still processed
- Malformed URLs with no host → rejected
- Protocol-relative URLs (`//cdn.example.com/file.js`) → accepted, `https` assumed
Invalid URLs are silently dropped in streaming mode. Use `--invalid-output` in batch mode to collect them for inspection.
## License
MIT. See [LICENSE](LICENSE).