Crate tokmat

Expand description

§tokmat

tokmat is a standalone Rust crate for metadata-driven tokenization and TEL-based extraction of Canadian-style address strings.

It is the low-level parsing core: other crates can build strategies, pipelines, analytics, or language bindings on top of it without pulling in broader workspace assumptions.

tokmat now uses PCRE2 as its runtime regex engine across tokenization, TEL compilation, and extractor execution.

§Highlights

Standalone core crate with no sibling-workspace runtime assumptions
PCRE2-only runtime regex path across tokenization and extraction
Metadata-driven TEL extraction over token classes instead of raw-text-only matching
File-backed token models plus inline/in-memory model support
Per-model, configurable word definition (the character class that delimits tokens) — loaded from WORDDEFINITION.param and applied per model for tokenization, with no process-global side effects from loading a model
Reference corpus tests, doctests, linting, and publish dry-run validation

§Why this crate exists

tokmat separates address parsing into two explicit phases:

Tokenization and classification
TEL-driven extraction over token classes

That split keeps the parser predictable.

Tokenization decides where boundaries are.
Classification decides what each token is.
TEL decides which token-class sequence to match and what to capture.

This is a better fit for messy address data than pushing everything into one monolithic regex.

§Parsing model

Raw input
  |
  v
+---------------------------+
| normalize / clean input   |
+---------------------------+
  |
  v
+---------------------------+
| tokenize into boundaries  |
| ex: ["123", " ", "MAIN"]  |
+---------------------------+
  |
  v
+---------------------------+
| classify each token       |
| ex: ["NUM", " ", "ALPHA"] |
+---------------------------+
  |
  v
+---------------------------+
| compile TEL pattern       |
| ex: <<NUM#>> <<NAME@+>>   |
+---------------------------+
  |
  v
+---------------------------+
| match on class stream     |
| capture named fields      |
+---------------------------+

The important design point is that TEL operates over token metadata, not only raw characters.

§Extractor entry modes

The extractor exposes two ways to run TEL:

parse_tokens(...)
compile_pattern(...) + parse_compiled_tokens(...)

They are not two different extractors. They are two entry points into the same extractor runtime.

Compat path
pattern string
  -> compile or fetch compiled TEL pattern
  -> build/fetch object plan
  -> run extractor

Precompiled path
compiled pattern
  -> build/fetch object plan
  -> run extractor

§When to use each

Use parse_tokens(...) when:

you want the simplest API
patterns are dynamic or user-supplied
you are fine relying on the internal compiled-pattern cache

Use compile_pattern(...) + parse_compiled_tokens(...) when:

you load a fixed TEL set once and reuse it many times
you want TEL validation to happen up front
you expect high pattern churn or a tiny compiled-pattern cache

§Which API should I call?

Use this rule of thumb:

Do you already have a compiled TEL set that will be reused?
  |
  +-- no  -> use parse_tokens(...)
  |
  +-- yes -> use parse_compiled_tokens(...)

Another way to say it:

application code and ad hoc parsing usually want parse_tokens(...)
long-lived workers, services, and batch pipelines usually want precompiled TEL patterns

§Why they can benchmark the same

On the reference corpus used by this crate:

695 extractor cases
344 unique TEL patterns
default compiled-pattern cache capacity: 512

That means the compat path quickly warms the cache and then behaves almost like the precompiled path. In the 10MM volume benchmark the two extractor modes were effectively identical:

10MM operations, default cache sizes

extractor-compat      30,407 ops/s   16.8 MB RSS
extractor-precompiled 30,127 ops/s   16.2 MB RSS

That result does not mean precompiled mode is useless. It means the current corpus is cache-friendly.

§When precompiled actually matters

Under cache pressure, precompiled mode separates clearly. With the compiled-pattern cache forced to capacity 1:

1MM operations, compiled-pattern cache = 1

extractor-compat      12,828 ops/s    7.2 MB RSS
extractor-precompiled 30,609 ops/s   12.5 MB RSS

precompiled vs compat: 2.386x faster

Interpretation:

compat is the convenience API
precompiled is the explicit reuse API
on cache-friendly workloads they converge
on churn-heavy workloads precompiled mode avoids repeated TEL compilation cost

§TEL in one page

TEL stands for Token Extraction Language.

A TEL pattern is made of typed segments:

Captures: <<FIELD>>
Captures with type modifiers: <<STREET@+>>
Explicit class constraints: <<TYPE::STREETTYPE>>
Vanishing groups: <!PROV!>, <!@!>, <!DROP@!>, <!DROP::STREETTYPE!>, or <!,!>
Optional vanishing groups: <!DASH?!>, <!@%?!>, <!DROP@%?!>, <!,?!>, or <!{{#}}?!>
Literal blocks: {{PO BOX}}

Common modifiers:

@ alpha-like token matching
# numeric token matching
% extended token matching
+ one or more
? optional
$ greedy matching
::CLASSNAME explicit class assignment

Examples:

<<CIVIC#>> <<STREET@+>> <<TYPE::STREETTYPE>>
{{PO BOX}} <<BOXNUM#>>
<<CITY@+$>> <<PROV::PROV>> <<PC::PCODE>>
<<UNIT::NUM>> <!DASH!> <<CIVIC::NUM>> <<STREET::NUM>> <<TYPE::STREETTYPE>>

The last example intentionally requires a street number after the dashed civic range. It accepts shapes such as 11-47 7 ST, but does not accept 11-47 ST where it is unclear whether the final number is missing or the civic range was misread as part of the street.

See docs/TEL_SPEC.md for a cleaner language reference.

§Quick start

§In-memory token model

This example keeps the model inline so it is easy to understand and compiles without external files.

use std::collections::HashSet;

use tokmat::extractor::Extractor;
use tokmat::tokenizer::{tokenize_and_classify, TokenClassList, TokenDefinition};

let token_definitions: TokenDefinition = vec![
    ("NUM".into(), r"\d+".into()),
    ("ALPHA".into(), r"[A-Z]+".into()),
    ("ALPHA_EXTENDED".into(), r"[A-Z][A-Z'\\-]*".into()),
];

let token_class_list: TokenClassList = vec![
    ("STREETTYPE".into(), HashSet::from(["ST".to_string(), "AVE".to_string()])),
];

let tokenized = tokenize_and_classify(
    "123 MAIN ST",
    &token_definitions,
    Some(&token_class_list),
);

assert_eq!(tokenized.tokens, vec!["123", " ", "MAIN", " ", "ST"]);
assert_eq!(tokenized.types[0], "NUM");

let extractor = Extractor::new(token_definitions, token_class_list);
let (_, fields, complement) =
    extractor.parse_string("123 MAIN ST", "<<CIVIC#>> <<NAME@+>> <<TYPE::STREETTYPE>>")?;

assert_eq!(fields.get("CIVIC").map(String::as_str), Some("123"));
assert_eq!(fields.get("NAME").map(String::as_str), Some("MAIN"));
assert_eq!(fields.get("TYPE").map(String::as_str), Some("ST"));
assert_eq!(complement, "");

§File-backed token model

If you already have a model directory in the wanParser-style layout:

model/
  TOKENDEFINITION/TOKENDEFINITONS.param2
  TOKENCLASS/*.param

you can load it directly:

use tokmat::extractor::Extractor;
use tokmat::token_model::TokenModel;
use tokmat::tokenizer::tokenize_with_model;

let model = TokenModel::load("tests/fixtures/model_1")?;
let tokenized = tokenize_with_model("123 MAIN ST", &model);

let extractor = Extractor::new(
    model.token_definitions().clone(),
    model.token_class_list().clone(),
);

let (_, fields, _) =
    extractor.parse_string("123 MAIN ST", "<<CIVIC#>> <<NAME@+>> <<TYPE::STREETTYPE>>")?;

assert_eq!(tokenized.tokens[0], "123");
assert_eq!(fields.get("CIVIC").map(String::as_str), Some("123"));

§Word definition (per model)

The “word definition” is the character class that decides where tokens break. It defaults to \w\-' (word chars, hyphen, apostrophe) and can be overridden per model with an optional WORDDEFINITION.param file in the model directory:

model/
  TOKENDEFINITION/TOKENDEFINITONS.param2
  TOKENCLASS/*.param
  WORDDEFINITION.param      # first non-comment line is the char class, e.g. [\w\-'.]

The definition is per model: a TokenModel carries its own WordDefinition and compiled boundary, and tokenization uses that boundary directly. Loading a model has no process-global side effect, so several models with different word definitions can coexist in one process without contaminating each other.

use tokmat::token_model::TokenModel;
use tokmat::tokenizer::{split_input_tokens_with, tokenize_with_model};

let model = TokenModel::load("tests/fixtures/model_1")?;

// Inspect the active definition (char class, e.g. `\w\-'`).
let _chars = model.word_definition().chars();

// Tokenize using this model's boundary (no global state):
let tokens = split_input_tokens_with("192.168 MAIN", model.word_boundary());
// With the default definition, `.` is a boundary -> "192" / "168" split.
// A model whose WORDDEFINITION.param is `[\w\-'.]` keeps "192.168" as one token.
let _ = (tokens, tokenize_with_model("192.168 MAIN", &model));

That model-owned boundary is important when you write TEL for punctuation. TEL matches the class stream created by the model, then returns aligned raw text for captures:

Model behavior	Example class stream for `11-47 OAK ST`	TEL shape
Hyphen is word-like	`NUM_EXTENDED ALPHA STREETTYPE`	`<<CIVIC::NUM_EXTENDED>> ...`
Hyphen is a literal boundary	`NUM - NUM ALPHA STREETTYPE`	`<<UNIT::NUM>> <!{{-}}!> <<CIVIC::NUM>> ...`
Hyphen is classified	`NUM DASH NUM ALPHA STREETTYPE`	`<<UNIT::NUM>> <!DASH!> <<CIVIC::NUM>> ...`

Use <!DASH?!> when the stream contains a DASH class. Use <!-?!> when the stream contains a literal -, but note that the default word definition absorbs hyphen into extended classes such as NUM_EXTENDED, ALPHA_EXTENDED, and ALPHA_NUM_EXTENDED, so there is often no standalone - token to vanish. Direct literal punctuation such as <!,?!> is allowed inside a vanishing group. Escape punctuation that is also TEL syntax with a literal block, for example <!,{{?}}?!> for an optional literal ,? sequence. Type modifiers also work inside vanishing groups: <!@!> consumes one alpha token without capturing it, <!DROP@!> does the same with a readability label that is discarded, and <!DROP::STREETTYPE!> consumes a street-type class without adding DROP to the output. <!{{@}}!> consumes a literal @ token. A literal block such as <!{{1-1}}?!> matches that exact class-stream text; it does not match NUM DASH NUM or a single extended token after the model has classified it.

§Two-phase extraction

The crate is easiest to reason about when you think in phases.

§Phase 1: tokenization

Input:

APT-210 O'CONNOR ST

Boundary handling preserves address-relevant shapes:

["APT-210", " ", "O'CONNOR", " ", "ST"]

This matters because APT-210 and O'CONNOR should not be destroyed by a simplistic whitespace-only split.

§Phase 2: metadata-driven extraction

Once each token has a type or class, TEL matches over the class sequence rather than blindly over raw characters.

Example:

Tokens : ["123", " ", "MAIN", " ", "ST"]
Types  : ["NUM", " ", "ALPHA", " ", "ALPHA"]
Class  : ["NUM", " ", "ALPHA", " ", "STREETTYPE"]
TEL    : <<CIVIC#>> <<NAME@+>> <<TYPE::STREETTYPE>>

The TYPE field is extracted because ST is known to belong to the STREETTYPE class.

That is the metadata-driven part of the design: the extraction rule is not just matching the text "ST", it is matching the semantic class attached to that token.

§Benchmarks

§0.3.0 performance work (1M-row benchmark via `tokmat-polars`)

Measured on a 1M-row Canadian-address corpus through the tokmat-polars criterion harness (wall-clock for 1,000,000 rows). These reflect tokmat-core changes in 0.3.0 — the single-pass classifier, atomic counters, and especially removing a per-row PCRE2 JIT-stack mmap (the dominant extraction cost):

1,000,000 rows                         0.2.x        0.3.0       speedup
tokenize                               5.22 s       0.66 s      ~8x
extract (from pre-tokenized input)     ~7.4 s       ~1.0 s      ~7x
extract (from raw strings)             ~12.5 s      3.80 s      ~3.3x

Takeaway:

The per-row JIT-stack mmap (from captures() allocating a fresh MatchData with a 5 MB JIT stack) dominated extraction and serialized parallel work on the kernel mmap lock; removing it is the single biggest win.
Tokenization also gained from passing each model’s boundary explicitly instead of reading a process-global lock per row.

§Earlier benchmark snapshots

The benchmark scripts and JSON artifacts used during crate extraction live in the parent repo:

scripts/benchmark_tokmat_variants.py
scripts/benchmark_extractor_mode_tradeoffs.py

Two benchmark snapshots are especially useful:

§PCRE2-only crate vs earlier mixed-engine crate

10MM operations

tokenizer
  mixed engines : 354,382 ops/s   6.1 MB RSS
  pcre2 only    : 564,171 ops/s   3.6 MB RSS

extractor-compat
  mixed engines : 30,407 ops/s   16.8 MB RSS
  pcre2 only    : 30,435 ops/s   12.6 MB RSS

extractor-precompiled
  mixed engines : 30,127 ops/s   16.2 MB RSS
  pcre2 only    : 30,168 ops/s   12.6 MB RSS

Takeaway:

PCRE2-only materially improves tokenizer throughput
extractor throughput stays essentially flat
RSS drops across the measured workloads

§Extractor mode trade-off under cache pressure

1MM operations, compiled-pattern cache = 1

extractor-compat      12,828 ops/s    7.2 MB RSS
extractor-precompiled 30,609 ops/s   12.5 MB RSS

Takeaway:

default corpus + default cache sizes make compat and precompiled look similar
precompiled mode matters when many pattern compiles would otherwise be repeated
if you do not know yet, start with parse_tokens(...) and only move to precompiled patterns when you need explicit reuse or validation

§What makes the crate polished for publication

Standalone fixture corpus under tests/
Strict linting through Clippy
Complexity gate validated during development
Formal TEL grammar in grammar/tel.ebnf
Public docs suitable for crates.io and docs.rs

§Release workflow

tokmat can be published from GitHub Actions on tag pushes that match v*. The CI workflow already validates formatting, Clippy, tests, docs, and a publish dry-run. The release workflow should remain limited to crates.io publication because this repository is the parser kernel, not the Python/Polars distribution surface.

Release steps:

Update the version in Cargo.toml under [package] version to 0.2.0.
Commit the version bump.
Create and push a tag matching the version:

VERSION=0.2.0
git add -A
git commit -m "Release ${VERSION}"
git tag "v${VERSION}"
git push origin "v${VERSION}"

Before the first release, add a crates.io API token to the repository secrets as CARGO_REGISTRY_TOKEN.

§Limitations

The crate is intentionally low-level. It does not try to solve full multi-strategy address interpretation by itself.
TEL is powerful, but it assumes you have a reasonable token model.
The API focuses on extraction primitives; higher-level strategy orchestration belongs in layers above this crate.

§Security: PCRE2 linkage

tokmat runs all regex through PCRE2. Because token-definition and TEL patterns are compiled and matched by PCRE2, the regex engine version matters:

CVE-2025-58050 is a heap-buffer-overflow read in PCRE2’s (*scs:...) / (*ACCEPT) “scan substring” path. It affects PCRE2 10.45 and is fixed in 10.46. It is only reachable via attacker-controlled regex (a malicious token-definition or TEL pattern), not via parsed input strings.
pcre2-sys vendors a patched PCRE2 10.46, but by default it prefers a system libpcre2-8 if pkg-config finds one — so a host with system PCRE2 10.45 would be exposed. This repo ships a .cargo/config.toml that sets PCRE2_SYS_STATIC=1, statically linking the patched vendored copy so the engine version is deterministic.
If you depend on tokmat, set PCRE2_SYS_STATIC=1 in your own build (or ensure your system PCRE2 is >= 10.46). Treat token-definition / TEL patterns as trusted configuration, not untrusted input.

§License

MIT. See the LICENSE file in the crate root.

Modules§

error
extractor: Extraction engine ported from the Python wanParser implementation.
normalize
tel: Typed Token Extraction Language (TEL) parsing and validation.
token_model: Typed token model loading and compiled lookup state.
tokenizer: Tokenization helpers for the wanParser Rust port.
word_definition: Configurable definition of a “word” – the character class that delimits tokens.

Crate tokmat

Crate tokmat Copy item path

§tokmat

§Highlights

§Why this crate exists

§Parsing model

§Extractor entry modes

§When to use each

§Which API should I call?

§Why they can benchmark the same

§When precompiled actually matters

§TEL in one page

§Quick start

§In-memory token model

§File-backed token model

§Word definition (per model)

§Two-phase extraction

§Phase 1: tokenization

§Phase 2: metadata-driven extraction

§Benchmarks

§0.3.0 performance work (1M-row benchmark via tokmat-polars)

§Earlier benchmark snapshots

§PCRE2-only crate vs earlier mixed-engine crate

§Extractor mode trade-off under cache pressure

§What makes the crate polished for publication

§Release workflow

§Limitations

§Security: PCRE2 linkage

§License

Modules§

Crate tokmat

§0.3.0 performance work (1M-row benchmark via `tokmat-polars`)