iriq 0.30.0

IRI/URL extraction, normalization, and shape clustering.
Documentation
# iriq — IRI/URL extraction, normalization, shape clustering

**iriq finds the *shape* of a URL** — the route template hiding behind it.
Erase the parts that vary, keep the parts that don't: `/users/123` and
`/users/999` are both `/users/{user_id}`. Point it at a pile of messy URLs and
it collapses them into a small set of stable, deterministic templates.

(An **IRI** is just a URL — the internationalized superset of URI/URL that also
allows non-ASCII characters. If you know URLs, you know IRIs. The name is *IRI
Query*: iriq queries an IRI for its structure.)

```toml
[dependencies]
iriq = "0.29"
```

Corpora persist to SQLite (bundled `rusqlite`, WAL, concurrent observers) — no
system dependency, nothing to set up.

## What it does

```rust
use iriq::{parse, normalize, Extractor, trace};

// Parse and normalize a single URL.
let iri = parse("https://Foo.com:443/users/123")?;
assert_eq!(iri.host, "foo.com");
assert_eq!(iri.port, 0);            // default port dropped
assert_eq!(normalize("https://foo.com/users/123")?,
           "https://foo.com/users/{user_id}");

// Pull URLs out of free text.
let urls = Extractor::new().extract_strings(
    "Visit https://foo.com today, also hit foo.com/users."
);
assert_eq!(urls.len(), 2);

// Annotated trace (what the CLI shows under `-e`).
let tr = trace("https://shop.com/pricing/usd?currency=eur")?;
assert_eq!(tr.normalized, "https://shop.com/pricing/USD?currency=EUR");
# Ok::<(), Box<dyn std::error::Error>>(())
```

Streaming clustering with a persistent corpus:

```rust,no_run
use iriq::Corpus;

// Persisted to SQLite (.db / .sqlite / .sqlite3).
let mut corpus = Corpus::open("c.db")?;
for url in &["https://foo.com/users/1",
             "https://foo.com/users/2",
             "https://foo.com/users/3"] {
    corpus.observe(url)?;
}
corpus.save("c.db")?;
# Ok::<(), Box<dyn std::error::Error>>(())
```

## What's covered

- **Segment classification**~25 typed shapes (UUID, ISO date, file, email,
  IPv4/6, color, coordinate, country, base64, JWT, MIME, phone, and more).
- **Shape normalization** — route templates with canonical date and currency
  rendering, and RESTful hints (`{user_id}` from `/users/123`).
- **Trace** — per-segment annotations for any URL (`trace`).
- **Corpus** — streaming shape clustering, param-type inference, and learning:
  `--stats`, `--reinfer`, `--propose-recognizers`, `--cross-host-shapes`.

## License

[MIT](https://github.com/dpep/iriq). See the [crate docs](https://docs.rs/iriq)
for the full API and the [project README](https://github.com/dpep/iriq) for the
conceptual overview.