urlsieve 0.1.0 - Docs.rs

# Sieve

[![Crates.io](https://img.shields.io/crates/v/urlsieve)](https://crates.io/crates/urlsieve)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)

> Intelligent URL deduplication for offensive security workflows.
> Reduce 10,000 UUID-varied endpoints to 80 unique attack surfaces.

## The Problem

Modern crawlers (katana, hakrawler) and fuzzers output massive lists of URLs. Feeding 10,000 variations of the same endpoint (e.g., `/api/v1/users/{uuid}/profile`) into scanners like `nuclei` or `sqlmap` wastes compute time, burns bandwidth, and triggers rate limiting that gets your IP banned by WAFs.

## Why Sieve?

| Feature | `anew` | `uro` | **urlsieve** |
|---------|--------|-------|-----------|
| Semantic path normalization | ❌ | ✅ | ✅ |
| Streaming (zero-latency) | ❌ | ❌ | ✅ |
| Auto-learn patterns from data | ❌ | ❌ | ✅ |
| Configurable normalization rules | ❌ | ❌ | ✅ |

## Install

```bash
cargo install urlsieve
```

Or build from source:

```bash
cargo build --release
./target/release/urlsieve --help
```

## Quick Start

```bash
# Deduplicate a file (reads from stdin by default)
cat urls.txt | urlsieve

# Read from file, write to file
urlsieve -i urls.txt -o deduped.txt

# Show statistics
urlsieve -i urls.txt --stats
```

> **Note:** Streaming mode (default, no `--sort`) emits the first seen URL per group. For deterministic output (lexicographically smallest representative), use `--sort`.

## Real-World Results

| Dataset | Input URLs | Unique Fingerprints | Reduction |
|---|---|---|---|
| REST API (marketplace) | 1,706 | 81 | 95% |
| Mixed crawl (GAU output) | 498 | 319 | 36% |

REST APIs with UUID/numeric IDs compress aggressively. Static asset-heavy sites (WordPress, Next.js) compress less -- Sieve correctly preserves semantically distinct paths.

## How It Works

Each URL is parsed and a **fingerprint** is generated by normalizing dynamic path segments and query parameters:

| Segment Example | Detected As | Fingerprint |
|---|---|---|
| `df8b8a77-6f3e-4733-978c-f0b8fa28b0a4` | UUID | `{uuid}` |
| `a1b2c3d4e5f6a7b8c9d0` | Hash | `{hash}` |
| `57ea1f72-7abe-4d77-be04-94037844e8a2` | MongoDB ObjectId | `{mongo}` |
| `01ARZ3NDEKTSV4RRFFQ69G5FAV` | ULID | `{ulid}` |
| `12345678` | Numeric ID | `{id}` |
| `1706832000` | Epoch | `{epoch}` |
| `2024-01-15` | ISO date | `{date}` |
| `dGhpcyBpcyBhIHRva2Vu` | Base64 | `{token}` |
| `23c6DSKX` | Short token | `{slug}` |

Query parameter values are analyzed via Shannon entropy to distinguish dynamic tokens from static values. Common auth/cache-bust keys (`token`, `session`, `ts`, `_`, `cb`) are always normalized.

## Output Formats

```bash
# rep (default) -- one representative URL per group
urlsieve -i urls.txt --format rep

# counted -- URL with duplicate count as inline comment
urlsieve -i urls.txt --format counted

# json -- single JSON object with full structure
urlsieve -i urls.txt --format json

# jsonl -- one JSON object per line (stream-friendly)
urlurlsieve -i urls.txt --format jsonl
```

## Streaming Mode (Zero-Latency Pipeline)

When using `--format rep` or `--format jsonl` without `--sort`, Sieve operates in streaming mode: URLs are emitted to stdout immediately as they are read, with zero buffering of the full result set. Only a `u64` hash per unique fingerprint is kept in memory.

This enables real-time pipelines where downstream tools start processing the first URL instantly:

```bash
# nuclei begins scanning immediately -- no wait for full dedup
cat 50M_urls.txt | urlsieve | nuclei -l -

# Same with httpx
urlsieve -i urls.txt | httpx -silent -status-code
```

Streaming mode is automatically disabled when `--sort`, `--format counted`, `--format json`, or `--invalid-output` is used, as these require the full dataset before output.

## Diff Mode

Compare a new URL list against a previously seen baseline, outputting only new (not previously seen) URLs:

```bash
# Fingerprint-based diff (new structural paths only)
urlsieve --diff baseline.txt -i new_urls.txt

# Strict diff (exact URL match)
urlsieve --diff baseline.txt --diff-strict -i new_urls.txt

# Strip query params before comparison
urlsieve --diff baseline.txt --strip-query -i new_urls.txt
```

## Learn Mode

Analyze URL cardinality to determine which path segments and query parameters are dynamic, then automatically generate a config:

```bash
# Print cardinality report (goes to stderr)
urlsieve --learn -i urls.txt

# Save learned config to TOML
urlsieve --learn --save-config learned.toml -i urls.txt

# Analyze and immediately apply the learned config
urlsieve --learn --apply -i urls.txt --stats
```

> **Note:** `--learn` infers path segment patterns from a sample of up to 10 unique values. Positions with extremely high cardinality (>500 unique values) are skipped to avoid generating overly broad patterns — those segments are handled by the entropy detector instead. Only UUID and numeric patterns are auto-generated; other types fall back to entropy-based detection.

## Performance

Sieve is engineered to handle pipelines with tens of millions of URLs:

- **Throughput:** ~425k URLs/sec on a modern processor (full pipeline: parse + fingerprint + dedup).
- **Memory:** ~8 bytes per unique fingerprint in streaming mode (only a `u64` hash is stored).
- **Batch mode:** Holds full result set in memory; suitable for datasets up to several million URLs.
- **Reusable fingerprint buffer:** A single pre-allocated `String` is reused across all URLs, avoiding heap allocation per URL in the hot path.
- **Single-pass regex matching:** All detection patterns (UUID, hash, ULID, epoch, etc.) are compiled into one `RegexSet` and evaluated simultaneously, not sequentially.
- **Fast hashing:** Uses `Rapidhash` for fingerprint deduplication in streaming mode -- one of the fastest non-cryptographic hash algorithms, passing all SMHasher quality tests.

## Configuration

Sieve uses a TOML config file with sensible defaults. Override with `-c`:

```bash
urlsieve -i urls.txt -c myconfig.toml
```

Default config:

```toml
[general]
min_segment_len = 8
entropy_threshold = 3.5
patterns = ["uuid", "hash", "numid", "timestamp", "epoch", "base64", "mongo", "ulid", "short_token", "entropy"]

[normalize_params]
always_normalize = ["token", "session", "session_id", "user_id", "auth", "api_key", "jwt", "access_token", "refresh_token", "csrf", "signature", "ts", "timestamp", "_", "cb", "cachebust", "rand", "seed"]
never_normalize = ["page", "limit", "offset", "sort", "order", "format", "callback"]

[structural]
literal_segments = ["api", "graphql", "health", "status", "favicon.ico", "robots.txt"]
pattern_segments = ["v\\d+"]
```

## CLI Reference

| Flag | Description |
|---|---|
| `-i, --input` | Input file (reads stdin if omitted) |
| `-o, --output` | Output file (writes stdout if omitted) |
| `-f, --format` | Output format: `rep`, `counted`, `json`, `jsonl` (default: `rep`) |
| `-c, --config` | Config file (TOML) |
| `--stats` | Show deduplication statistics |
| `--patterns` | Patterns to enable (comma-separated, or `all`) |
| `--min-segment-len` | Minimum segment length for entropy check |
| `--entropy-threshold` | Shannon entropy threshold for dynamic detection |
| `--normalize-param-keys` | Param keys whose values are always normalized (comma-separated) |
| `--keep-param-keys` | Param keys whose values are never normalized (comma-separated) |
| `--strip-query` | Remove query params from fingerprint entirely |
| `--assume-scheme` | Prepend scheme to scheme-less URLs (default: `https`) |
| `--invalid-output` | Write invalid/malformed URLs to file |
| `--learn` | Analyze cardinality and print report |
| `--apply` | Apply learned config during analysis (requires `--learn`) |
| `--save-config` | Save learned config to TOML file (requires `--learn`) |
| `--diff` | Compare against baseline file (fingerprint match) |
| `--diff-strict` | Use exact URL matching in diff mode |
| `--sort` | Sort output by fingerprint (deterministic, disables streaming) |

## Common Workflows

```bash
# Basic dedup with stats
urlsieve -i urls.txt --stats

# Dedup, then scan with nuclei
urlsieve -i urls.txt | nuclei -l -

# Diff against yesterday's scan
urlsieve --diff yesterday.txt -i today.txt --stats

# Learn from a large dataset, then apply
urlsieve --learn --save-config recon.toml -i all_urls.txt
urlsieve -i new_urls.txt -c recon.toml --stats

# Analyze and immediately apply learned config in one step
urlsieve --learn --apply -i urls.txt --stats

# Stream JSONL output for downstream processing
urlurlurlsieve -i urls.txt --format jsonl | jq -r '.representative'

# Dedup scheme-less URLs
urlsieve -i hosts.txt --assume-scheme https

# Dedup ignoring query params (useful when only params differ)
urlsieve -i urls.txt --strip-query --stats

# Save invalid URLs for inspection
urlsieve -i urls.txt --invalid-output invalid.txt --stats
```

## What Counts as an Invalid URL?

Sieve rejects URLs that cannot be parsed as valid HTTP/HTTPS:

- `ftp://`, `file://`, `javascript:` schemes → rejected
- Invalid percent-encoding (e.g., `%GG`) → decoded lossily, still processed
- Malformed URLs with no host → rejected
- Protocol-relative URLs (`//cdn.example.com/file.js`) → accepted, `https` assumed

Invalid URLs are silently dropped in streaming mode. Use `--invalid-output` in batch mode to collect them for inspection.

## License

MIT. See [LICENSE](LICENSE).