iriq 0.29.1

IRI extraction, normalization, and clustering — peer of the Ruby gem + Go module.
Documentation

iriq — IRI extraction, normalization, clustering

A Rust port of the iriq Ruby gem and Go module. Same behavior across all three runtimes — enforced by golden JSON fixtures and a CLI parity harness in CI.

[dependencies]
iriq = "0.29"

For SQLite-backed corpora (the on-disk store with concurrent observers):

[dependencies]
iriq = { version = "0.29", features = ["sqlite"] }

What it does

use iriq::{parse, normalize, Extractor, Corpus, trace};

// Parse and normalize a single URL.
let iri = parse("https://Foo.com:443/users/123")?;
assert_eq!(iri.host, "foo.com");
assert_eq!(iri.port, 0);            // default port dropped
assert_eq!(normalize("https://foo.com/users/123")?,
           "https://foo.com/users/{user_id}");

// Pull IRIs out of free text.
let urls = Extractor::new().extract_strings(
    "Visit https://foo.com today, also hit foo.com/users."
);
assert_eq!(urls.len(), 2);

// Annotated trace (what the CLI shows under `-e`).
let tr = trace("https://shop.com/pricing/usd?currency=eur")?;
assert_eq!(tr.normalized, "https://shop.com/pricing/USD?currency=EUR");

// Streaming clustering with a persistent corpus.
let mut corpus = Corpus::open("c.db")?;       // .db/.sqlite/.sqlite3 → SQLite
for url in &["https://foo.com/users/1",
             "https://foo.com/users/2",
             "https://foo.com/users/3"] {
    corpus.observe(url)?;
}
corpus.save("c.db")?;
# Ok::<(), Box<dyn std::error::Error>>(())

See the crate docs for the full API and the main project README for the conceptual overview shared with the Ruby + Go siblings.

Features

Feature What it does
(default) Memory + JSON corpus backends. Pure Rust, no system deps.
sqlite Adds the SQLite corpus backend via bundled rusqlite. Concurrent writers, incremental UPSERTs.

Parity guarantees

This crate is byte-identical to the Ruby gem + Go module on:

  • All segment classification decisions (~25 typed shapes — UUID, ISO date, file, email, IPv4/6, color, coordinate, country, base64, JWT, MIME, phone, etc.).
  • Iriq::Normalizer.normalize / iriq.Normalize outputs, including hint suppression for semantic types and canonical date / currency rendering.
  • Iriq::Trace.for / iriq.Trace JSON structure for -e output.
  • Corpus shape clustering, param-type inference, --stats / --reinfer / --propose-recognizers / --cross-host-shapes output.
  • Cross-runtime SQLite corpus files (schema v4 is shared — a .db created by the Go CLI opens cleanly under the Rust CLI and vice versa).

Anywhere they diverge is a bug — file an issue with the diff.

License

MIT, same as the Ruby gem and Go module.