tokmat 0.2.0

Standalone high-performance Canadian address parsing engine core
Documentation

tokmat

CI docs.rs crates.io

tokmat is a standalone Rust crate for metadata-driven tokenization and TEL-based extraction of Canadian-style address strings.

It is the low-level parsing core: other crates can build strategies, pipelines, analytics, or language bindings on top of it without pulling in broader workspace assumptions.

tokmat now uses PCRE2 as its runtime regex engine across tokenization, TEL compilation, and extractor execution.

Highlights

  • Standalone core crate with no sibling-workspace runtime assumptions
  • PCRE2-only runtime regex path across tokenization and extraction
  • Metadata-driven TEL extraction over token classes instead of raw-text-only matching
  • File-backed token models plus inline/in-memory model support
  • Reference corpus tests, doctests, linting, and publish dry-run validation

Why this crate exists

tokmat separates address parsing into two explicit phases:

  1. Tokenization and classification
  2. TEL-driven extraction over token classes

That split keeps the parser predictable.

  • Tokenization decides where boundaries are.
  • Classification decides what each token is.
  • TEL decides which token-class sequence to match and what to capture.

This is a better fit for messy address data than pushing everything into one monolithic regex.

Parsing model

Raw input
  |
  v
+---------------------------+
| normalize / clean input   |
+---------------------------+
  |
  v
+---------------------------+
| tokenize into boundaries  |
| ex: ["123", " ", "MAIN"]  |
+---------------------------+
  |
  v
+---------------------------+
| classify each token       |
| ex: ["NUM", " ", "ALPHA"] |
+---------------------------+
  |
  v
+---------------------------+
| compile TEL pattern       |
| ex: <<NUM#>> <<NAME@+>>   |
+---------------------------+
  |
  v
+---------------------------+
| match on class stream     |
| capture named fields      |
+---------------------------+

The important design point is that TEL operates over token metadata, not only raw characters.

Extractor entry modes

The extractor exposes two ways to run TEL:

  • parse_tokens(...)
  • compile_pattern(...) + parse_compiled_tokens(...)

They are not two different extractors. They are two entry points into the same extractor runtime.

Compat path
pattern string
  -> compile or fetch compiled TEL pattern
  -> build/fetch object plan
  -> run extractor

Precompiled path
compiled pattern
  -> build/fetch object plan
  -> run extractor

When to use each

Use parse_tokens(...) when:

  • you want the simplest API
  • patterns are dynamic or user-supplied
  • you are fine relying on the internal compiled-pattern cache

Use compile_pattern(...) + parse_compiled_tokens(...) when:

  • you load a fixed TEL set once and reuse it many times
  • you want TEL validation to happen up front
  • you expect high pattern churn or a tiny compiled-pattern cache

Which API should I call?

Use this rule of thumb:

Do you already have a compiled TEL set that will be reused?
  |
  +-- no  -> use parse_tokens(...)
  |
  +-- yes -> use parse_compiled_tokens(...)

Another way to say it:

  • application code and ad hoc parsing usually want parse_tokens(...)
  • long-lived workers, services, and batch pipelines usually want precompiled TEL patterns

Why they can benchmark the same

On the reference corpus used by this crate:

  • 695 extractor cases
  • 344 unique TEL patterns
  • default compiled-pattern cache capacity: 512

That means the compat path quickly warms the cache and then behaves almost like the precompiled path. In the 10MM volume benchmark the two extractor modes were effectively identical:

10MM operations, default cache sizes

extractor-compat      30,407 ops/s   16.8 MB RSS
extractor-precompiled 30,127 ops/s   16.2 MB RSS

That result does not mean precompiled mode is useless. It means the current corpus is cache-friendly.

When precompiled actually matters

Under cache pressure, precompiled mode separates clearly. With the compiled-pattern cache forced to capacity 1:

1MM operations, compiled-pattern cache = 1

extractor-compat      12,828 ops/s    7.2 MB RSS
extractor-precompiled 30,609 ops/s   12.5 MB RSS

precompiled vs compat: 2.386x faster

Interpretation:

  • compat is the convenience API
  • precompiled is the explicit reuse API
  • on cache-friendly workloads they converge
  • on churn-heavy workloads precompiled mode avoids repeated TEL compilation cost

TEL in one page

TEL stands for Token Extraction Language.

A TEL pattern is made of typed segments:

  • Captures: <<FIELD>>
  • Captures with type modifiers: <<STREET@+>>
  • Explicit class constraints: <<TYPE::STREETTYPE>>
  • Vanishing groups: <!PROV!>
  • Literal blocks: {{PO BOX}}

Common modifiers:

  • @ alpha-like token matching
  • # numeric token matching
  • % extended token matching
  • + one or more
  • ? optional
  • $ greedy matching
  • ::CLASSNAME explicit class assignment

Examples:

  • <<CIVIC#>> <<STREET@+>> <<TYPE::STREETTYPE>>
  • {{PO BOX}} <<BOXNUM#>>
  • <<CITY@+$>> <<PROV::PROV>> <<PC::PCODE>>

See docs/TEL_SPEC.md for a cleaner language reference.

Quick start

In-memory token model

This example keeps the model inline so it is easy to understand and compiles without external files.

use std::collections::HashSet;

use tokmat::extractor::Extractor;
use tokmat::tokenizer::{tokenize_and_classify, TokenClassList, TokenDefinition};

let token_definitions: TokenDefinition = vec![
    ("NUM".into(), r"\d+".into()),
    ("ALPHA".into(), r"[A-Z]+".into()),
    ("ALPHA_EXTENDED".into(), r"[A-Z][A-Z'\\-]*".into()),
];

let token_class_list: TokenClassList = vec![
    ("STREETTYPE".into(), HashSet::from(["ST".to_string(), "AVE".to_string()])),
];

let tokenized = tokenize_and_classify(
    "123 MAIN ST",
    &token_definitions,
    Some(&token_class_list),
);

assert_eq!(tokenized.tokens, vec!["123", " ", "MAIN", " ", "ST"]);
assert_eq!(tokenized.types[0], "NUM");

let extractor = Extractor::new(token_definitions, token_class_list);
let (_, fields, complement) =
    extractor.parse_string("123 MAIN ST", "<<CIVIC#>> <<NAME@+>> <<TYPE::STREETTYPE>>")?;

assert_eq!(fields.get("CIVIC").map(String::as_str), Some("123"));
assert_eq!(fields.get("NAME").map(String::as_str), Some("MAIN"));
assert_eq!(fields.get("TYPE").map(String::as_str), Some("ST"));
assert_eq!(complement, "");
# Ok::<(), tokmat::error::ParseError>(())

File-backed token model

If you already have a model directory in the wanParser-style layout:

model/
  TOKENDEFINITION/TOKENDEFINITONS.param2
  TOKENCLASS/*.param

you can load it directly:

use tokmat::extractor::Extractor;
use tokmat::token_model::TokenModel;
use tokmat::tokenizer::tokenize_with_model;

let model = TokenModel::load("tests/fixtures/model_1")?;
let tokenized = tokenize_with_model("123 MAIN ST", &model);

let extractor = Extractor::new(
    model.token_definitions().clone(),
    model.token_class_list().clone(),
);

let (_, fields, _) =
    extractor.parse_string("123 MAIN ST", "<<CIVIC#>> <<NAME@+>> <<TYPE::STREETTYPE>>")?;

assert_eq!(tokenized.tokens[0], "123");
assert_eq!(fields.get("CIVIC").map(String::as_str), Some("123"));
# Ok::<(), Box<dyn std::error::Error>>(())

Two-phase extraction

The crate is easiest to reason about when you think in phases.

Phase 1: tokenization

Input:

APT-210 O'CONNOR ST

Boundary handling preserves address-relevant shapes:

["APT-210", " ", "O'CONNOR", " ", "ST"]

This matters because APT-210 and O'CONNOR should not be destroyed by a simplistic whitespace-only split.

Phase 2: metadata-driven extraction

Once each token has a type or class, TEL matches over the class sequence rather than blindly over raw characters.

Example:

Tokens : ["123", " ", "MAIN", " ", "ST"]
Types  : ["NUM", " ", "ALPHA", " ", "ALPHA"]
Class  : ["NUM", " ", "ALPHA", " ", "STREETTYPE"]
TEL    : <<CIVIC#>> <<NAME@+>> <<TYPE::STREETTYPE>>

The TYPE field is extracted because ST is known to belong to the STREETTYPE class.

That is the metadata-driven part of the design: the extraction rule is not just matching the text "ST", it is matching the semantic class attached to that token.

Benchmarks

The benchmark scripts and JSON artifacts used during crate extraction live in the parent repo:

  • scripts/benchmark_tokmat_variants.py
  • scripts/benchmark_extractor_mode_tradeoffs.py

Two benchmark snapshots are especially useful:

PCRE2-only crate vs earlier mixed-engine crate

10MM operations

tokenizer
  mixed engines : 354,382 ops/s   6.1 MB RSS
  pcre2 only    : 564,171 ops/s   3.6 MB RSS

extractor-compat
  mixed engines : 30,407 ops/s   16.8 MB RSS
  pcre2 only    : 30,435 ops/s   12.6 MB RSS

extractor-precompiled
  mixed engines : 30,127 ops/s   16.2 MB RSS
  pcre2 only    : 30,168 ops/s   12.6 MB RSS

Takeaway:

  • PCRE2-only materially improves tokenizer throughput
  • extractor throughput stays essentially flat
  • RSS drops across the measured workloads

Extractor mode trade-off under cache pressure

1MM operations, compiled-pattern cache = 1

extractor-compat      12,828 ops/s    7.2 MB RSS
extractor-precompiled 30,609 ops/s   12.5 MB RSS

Takeaway:

  • default corpus + default cache sizes make compat and precompiled look similar
  • precompiled mode matters when many pattern compiles would otherwise be repeated
  • if you do not know yet, start with parse_tokens(...) and only move to precompiled patterns when you need explicit reuse or validation

What makes the crate polished for publication

  • Standalone fixture corpus under tests/
  • Strict linting through Clippy
  • Complexity gate validated during development
  • Formal TEL grammar in grammar/tel.ebnf
  • Public docs suitable for crates.io and docs.rs

Release workflow

tokmat can be published from GitHub Actions on tag pushes that match v*. The CI workflow already validates formatting, Clippy, tests, docs, and a publish dry-run. The release workflow should remain limited to crates.io publication because this repository is the parser kernel, not the Python/Polars distribution surface.

Release steps:

  1. Update the version in Cargo.toml under [package] version to 0.2.0.
  2. Commit the version bump.
  3. Create and push a tag matching the version:
VERSION=0.2.0
git add -A
git commit -m "Release ${VERSION}"
git tag "v${VERSION}"
git push origin "v${VERSION}"

Before the first release, add a crates.io API token to the repository secrets as CARGO_REGISTRY_TOKEN.

Limitations

  • The crate is intentionally low-level. It does not try to solve full multi-strategy address interpretation by itself.
  • TEL is powerful, but it assumes you have a reasonable token model.
  • The API focuses on extraction primitives; higher-level strategy orchestration belongs in layers above this crate.

License

MIT. See the LICENSE file in the crate root.