iriq — IRI/URL extraction, normalization, shape clustering
iriq finds the shape of a URL — the route template hiding behind it.
Erase the parts that vary, keep the parts that don't: /users/123 and
/users/999 are both /users/{user_id}. Point it at a pile of messy URLs and
it collapses them into a small set of stable, deterministic templates.
(An IRI is just a URL — the internationalized superset of URI/URL that also allows non-ASCII characters. If you know URLs, you know IRIs. The name is IRI Query: iriq queries an IRI for its structure.)
[]
= "0.29"
Corpora persist to SQLite (bundled rusqlite, WAL, concurrent observers) — no
system dependency, nothing to set up.
What it does
use ;
// Parse and normalize a single URL.
let iri = parse?;
assert_eq!;
assert_eq!; // default port dropped
assert_eq!;
// Pull URLs out of free text.
let urls = new.extract_strings;
assert_eq!;
// Annotated trace (what the CLI shows under `-e`).
let tr = trace?;
assert_eq!;
# Ok::
Streaming clustering with a persistent corpus:
use Corpus;
// Persisted to SQLite (.db / .sqlite / .sqlite3).
let mut corpus = open?;
for url in &
corpus.save?;
# Ok::
What's covered
- Segment classification — ~25 typed shapes (UUID, ISO date, file, email, IPv4/6, color, coordinate, country, base64, JWT, MIME, phone, and more).
- Shape normalization — route templates with canonical date and currency
rendering, and RESTful hints (
{user_id}from/users/123). - Trace — per-segment annotations for any URL (
trace). - Corpus — streaming shape clustering, param-type inference, and learning:
--stats,--reinfer,--propose-recognizers,--cross-host-shapes.
License
MIT. See the crate docs for the full API and the project README for the conceptual overview.