tokmat 0.2.0

Standalone high-performance Canadian address parsing engine core
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
# tokmat

[![CI](https://github.com/Jrakru/tokmat/actions/workflows/ci.yml/badge.svg)](https://github.com/Jrakru/tokmat/actions/workflows/ci.yml)
[![docs.rs](https://docs.rs/tokmat/badge.svg)](https://docs.rs/tokmat)
[![crates.io](https://img.shields.io/crates/v/tokmat.svg)](https://crates.io/crates/tokmat)

`tokmat` is a standalone Rust crate for metadata-driven tokenization and TEL-based extraction of
Canadian-style address strings.

It is the low-level parsing core: other crates can build strategies, pipelines, analytics, or
language bindings on top of it without pulling in broader workspace assumptions.

`tokmat` now uses PCRE2 as its runtime regex engine across tokenization, TEL compilation, and
extractor execution.

## Highlights

- Standalone core crate with no sibling-workspace runtime assumptions
- PCRE2-only runtime regex path across tokenization and extraction
- Metadata-driven TEL extraction over token classes instead of raw-text-only matching
- File-backed token models plus inline/in-memory model support
- Reference corpus tests, doctests, linting, and publish dry-run validation

## Why this crate exists

`tokmat` separates address parsing into two explicit phases:

1. Tokenization and classification
2. TEL-driven extraction over token classes

That split keeps the parser predictable.

- Tokenization decides where boundaries are.
- Classification decides what each token is.
- TEL decides which token-class sequence to match and what to capture.

This is a better fit for messy address data than pushing everything into one monolithic regex.

## Parsing model

```text
Raw input
  |
  v
+---------------------------+
| normalize / clean input   |
+---------------------------+
  |
  v
+---------------------------+
| tokenize into boundaries  |
| ex: ["123", " ", "MAIN"]  |
+---------------------------+
  |
  v
+---------------------------+
| classify each token       |
| ex: ["NUM", " ", "ALPHA"] |
+---------------------------+
  |
  v
+---------------------------+
| compile TEL pattern       |
| ex: <<NUM#>> <<NAME@+>>   |
+---------------------------+
  |
  v
+---------------------------+
| match on class stream     |
| capture named fields      |
+---------------------------+
```

The important design point is that TEL operates over token metadata, not only raw characters.

## Extractor entry modes

The extractor exposes two ways to run TEL:

- `parse_tokens(...)`
- `compile_pattern(...)` + `parse_compiled_tokens(...)`

They are not two different extractors. They are two entry points into the same extractor runtime.

```text
Compat path
pattern string
  -> compile or fetch compiled TEL pattern
  -> build/fetch object plan
  -> run extractor

Precompiled path
compiled pattern
  -> build/fetch object plan
  -> run extractor
```

### When to use each

Use `parse_tokens(...)` when:

- you want the simplest API
- patterns are dynamic or user-supplied
- you are fine relying on the internal compiled-pattern cache

Use `compile_pattern(...)` + `parse_compiled_tokens(...)` when:

- you load a fixed TEL set once and reuse it many times
- you want TEL validation to happen up front
- you expect high pattern churn or a tiny compiled-pattern cache

### Which API should I call?

Use this rule of thumb:

```text
Do you already have a compiled TEL set that will be reused?
  |
  +-- no  -> use parse_tokens(...)
  |
  +-- yes -> use parse_compiled_tokens(...)
```

Another way to say it:

- application code and ad hoc parsing usually want `parse_tokens(...)`
- long-lived workers, services, and batch pipelines usually want precompiled TEL patterns

### Why they can benchmark the same

On the reference corpus used by this crate:

- `695` extractor cases
- `344` unique TEL patterns
- default compiled-pattern cache capacity: `512`

That means the compat path quickly warms the cache and then behaves almost like the precompiled
path. In the 10MM volume benchmark the two extractor modes were effectively identical:

```text
10MM operations, default cache sizes

extractor-compat      30,407 ops/s   16.8 MB RSS
extractor-precompiled 30,127 ops/s   16.2 MB RSS
```

That result does not mean precompiled mode is useless. It means the current corpus is
cache-friendly.

### When precompiled actually matters

Under cache pressure, precompiled mode separates clearly. With the compiled-pattern cache forced to
capacity `1`:

```text
1MM operations, compiled-pattern cache = 1

extractor-compat      12,828 ops/s    7.2 MB RSS
extractor-precompiled 30,609 ops/s   12.5 MB RSS

precompiled vs compat: 2.386x faster
```

Interpretation:

- `compat` is the convenience API
- `precompiled` is the explicit reuse API
- on cache-friendly workloads they converge
- on churn-heavy workloads precompiled mode avoids repeated TEL compilation cost

## TEL in one page

TEL stands for Token Extraction Language.

A TEL pattern is made of typed segments:

- Captures: `<<FIELD>>`
- Captures with type modifiers: `<<STREET@+>>`
- Explicit class constraints: `<<TYPE::STREETTYPE>>`
- Vanishing groups: `<!PROV!>`
- Literal blocks: `{{PO BOX}}`

Common modifiers:

- `@` alpha-like token matching
- `#` numeric token matching
- `%` extended token matching
- `+` one or more
- `?` optional
- `$` greedy matching
- `::CLASSNAME` explicit class assignment

Examples:

- `<<CIVIC#>> <<STREET@+>> <<TYPE::STREETTYPE>>`
- `{{PO BOX}} <<BOXNUM#>>`
- `<<CITY@+$>> <<PROV::PROV>> <<PC::PCODE>>`

See [`docs/TEL_SPEC.md`](docs/TEL_SPEC.md) for a cleaner language reference.

## Quick start

### In-memory token model

This example keeps the model inline so it is easy to understand and compiles without external
files.

```rust
use std::collections::HashSet;

use tokmat::extractor::Extractor;
use tokmat::tokenizer::{tokenize_and_classify, TokenClassList, TokenDefinition};

let token_definitions: TokenDefinition = vec![
    ("NUM".into(), r"\d+".into()),
    ("ALPHA".into(), r"[A-Z]+".into()),
    ("ALPHA_EXTENDED".into(), r"[A-Z][A-Z'\\-]*".into()),
];

let token_class_list: TokenClassList = vec![
    ("STREETTYPE".into(), HashSet::from(["ST".to_string(), "AVE".to_string()])),
];

let tokenized = tokenize_and_classify(
    "123 MAIN ST",
    &token_definitions,
    Some(&token_class_list),
);

assert_eq!(tokenized.tokens, vec!["123", " ", "MAIN", " ", "ST"]);
assert_eq!(tokenized.types[0], "NUM");

let extractor = Extractor::new(token_definitions, token_class_list);
let (_, fields, complement) =
    extractor.parse_string("123 MAIN ST", "<<CIVIC#>> <<NAME@+>> <<TYPE::STREETTYPE>>")?;

assert_eq!(fields.get("CIVIC").map(String::as_str), Some("123"));
assert_eq!(fields.get("NAME").map(String::as_str), Some("MAIN"));
assert_eq!(fields.get("TYPE").map(String::as_str), Some("ST"));
assert_eq!(complement, "");
# Ok::<(), tokmat::error::ParseError>(())
```

### File-backed token model

If you already have a model directory in the wanParser-style layout:

```text
model/
  TOKENDEFINITION/TOKENDEFINITONS.param2
  TOKENCLASS/*.param
```

you can load it directly:

```rust,no_run
use tokmat::extractor::Extractor;
use tokmat::token_model::TokenModel;
use tokmat::tokenizer::tokenize_with_model;

let model = TokenModel::load("tests/fixtures/model_1")?;
let tokenized = tokenize_with_model("123 MAIN ST", &model);

let extractor = Extractor::new(
    model.token_definitions().clone(),
    model.token_class_list().clone(),
);

let (_, fields, _) =
    extractor.parse_string("123 MAIN ST", "<<CIVIC#>> <<NAME@+>> <<TYPE::STREETTYPE>>")?;

assert_eq!(tokenized.tokens[0], "123");
assert_eq!(fields.get("CIVIC").map(String::as_str), Some("123"));
# Ok::<(), Box<dyn std::error::Error>>(())
```

## Two-phase extraction

The crate is easiest to reason about when you think in phases.

### Phase 1: tokenization

Input:

```text
APT-210 O'CONNOR ST
```

Boundary handling preserves address-relevant shapes:

```text
["APT-210", " ", "O'CONNOR", " ", "ST"]
```

This matters because `APT-210` and `O'CONNOR` should not be destroyed by a simplistic
whitespace-only split.

### Phase 2: metadata-driven extraction

Once each token has a type or class, TEL matches over the class sequence rather than blindly over
raw characters.

Example:

```text
Tokens : ["123", " ", "MAIN", " ", "ST"]
Types  : ["NUM", " ", "ALPHA", " ", "ALPHA"]
Class  : ["NUM", " ", "ALPHA", " ", "STREETTYPE"]
TEL    : <<CIVIC#>> <<NAME@+>> <<TYPE::STREETTYPE>>
```

The `TYPE` field is extracted because `ST` is known to belong to the `STREETTYPE` class.

That is the metadata-driven part of the design: the extraction rule is not just matching the text
`"ST"`, it is matching the semantic class attached to that token.

## Benchmarks

The benchmark scripts and JSON artifacts used during crate extraction live in the parent repo:

- `scripts/benchmark_tokmat_variants.py`
- `scripts/benchmark_extractor_mode_tradeoffs.py`

Two benchmark snapshots are especially useful:

### PCRE2-only crate vs earlier mixed-engine crate

```text
10MM operations

tokenizer
  mixed engines : 354,382 ops/s   6.1 MB RSS
  pcre2 only    : 564,171 ops/s   3.6 MB RSS

extractor-compat
  mixed engines : 30,407 ops/s   16.8 MB RSS
  pcre2 only    : 30,435 ops/s   12.6 MB RSS

extractor-precompiled
  mixed engines : 30,127 ops/s   16.2 MB RSS
  pcre2 only    : 30,168 ops/s   12.6 MB RSS
```

Takeaway:

- PCRE2-only materially improves tokenizer throughput
- extractor throughput stays essentially flat
- RSS drops across the measured workloads

### Extractor mode trade-off under cache pressure

```text
1MM operations, compiled-pattern cache = 1

extractor-compat      12,828 ops/s    7.2 MB RSS
extractor-precompiled 30,609 ops/s   12.5 MB RSS
```

Takeaway:

- default corpus + default cache sizes make compat and precompiled look similar
- precompiled mode matters when many pattern compiles would otherwise be repeated
- if you do not know yet, start with `parse_tokens(...)` and only move to precompiled patterns
  when you need explicit reuse or validation

## What makes the crate polished for publication

- Standalone fixture corpus under `tests/`
- Strict linting through Clippy
- Complexity gate validated during development
- Formal TEL grammar in `grammar/tel.ebnf`
- Public docs suitable for crates.io and docs.rs

## Release workflow

`tokmat` can be published from GitHub Actions on tag pushes that match `v*`.
The CI workflow already validates formatting, Clippy, tests, docs, and a
publish dry-run. The release workflow should remain limited to crates.io
publication because this repository is the parser kernel, not the Python/Polars
distribution surface.

Release steps:

1. Update the version in `Cargo.toml` under `[package] version` to `0.2.0`.
2. Commit the version bump.
3. Create and push a tag matching the version:

```bash
VERSION=0.2.0
git add -A
git commit -m "Release ${VERSION}"
git tag "v${VERSION}"
git push origin "v${VERSION}"
```

Before the first release, add a crates.io API token to the repository secrets
as `CARGO_REGISTRY_TOKEN`.

## Limitations

- The crate is intentionally low-level. It does not try to solve full multi-strategy address
  interpretation by itself.
- TEL is powerful, but it assumes you have a reasonable token model.
- The API focuses on extraction primitives; higher-level strategy orchestration belongs in layers
  above this crate.

## License

MIT. See the `LICENSE` file in the crate root.