iriq 0.30.1

IRI/URL extraction, normalization, and shape clustering.
Documentation

iriq — IRI/URL extraction, normalization, shape clustering

iriq finds the shape of a URL — the route template hiding behind it. Erase the parts that vary, keep the parts that don't: /users/123 and /users/999 are both /users/{user_id}. Point it at a pile of messy URLs and it collapses them into a small set of stable, deterministic templates.

(An IRI is just a URL — the internationalized superset of URI/URL that also allows non-ASCII characters. If you know URLs, you know IRIs. The name is IRI Query: iriq queries an IRI for its structure.)

[dependencies]
iriq = "0.29"

Corpora persist to SQLite (bundled rusqlite, WAL, concurrent observers) — no system dependency, nothing to set up.

What it does

use iriq::{parse, normalize, Extractor, trace};

// Parse and normalize a single URL.
let iri = parse("https://Foo.com:443/users/123")?;
assert_eq!(iri.host, "foo.com");
assert_eq!(iri.port, 0);            // default port dropped
assert_eq!(normalize("https://foo.com/users/123")?,
           "https://foo.com/users/{user_id}");

// Pull URLs out of free text.
let urls = Extractor::new().extract_strings(
    "Visit https://foo.com today, also hit foo.com/users."
);
assert_eq!(urls.len(), 2);

// Annotated trace (what the CLI shows under `-e`).
let tr = trace("https://shop.com/pricing/usd?currency=eur")?;
assert_eq!(tr.normalized, "https://shop.com/pricing/USD?currency=EUR");
# Ok::<(), Box<dyn std::error::Error>>(())

Streaming clustering with a persistent corpus:

use iriq::Corpus;

// Persisted to SQLite (.db / .sqlite / .sqlite3).
let mut corpus = Corpus::open("c.db")?;
for url in &["https://foo.com/users/1",
             "https://foo.com/users/2",
             "https://foo.com/users/3"] {
    corpus.observe(url)?;
}
corpus.save("c.db")?;
# Ok::<(), Box<dyn std::error::Error>>(())

What's covered

  • Segment classification — ~25 typed shapes (UUID, ISO date, file, email, IPv4/6, color, coordinate, country, base64, JWT, MIME, phone, and more).
  • Shape normalization — route templates with canonical date and currency rendering, and RESTful hints ({user_id} from /users/123).
  • Trace — per-segment annotations for any URL (trace).
  • Corpus — streaming shape clustering, param-type inference, and learning: --stats, --reinfer, --propose-recognizers, --cross-host-shapes.

License

MIT. See the crate docs for the full API and the project README for the conceptual overview.