tokmat-polars
Standalone Polars integration crate for tokmat.
This crate depends on the published tokmat package from crates.io and is
intended to provide the dataframe-facing plugin layer around the core parser.
Rust usage
tokmat-polars can also be used directly from Rust as a normal library crate.
That path is useful when you want to build Series values in Rust and reuse the
same tokenization and extraction logic without going through Python.
use *;
use TokmatPolars;
let plugin = from_model_path?;
let input = new;
let tokenized = plugin.tokenize_series?;
let extracted = plugin.extract_series?;
# let _ = extracted;
# Ok::
Python packaging
tokmat-polars can also be built and published as a Python package via
maturin. The Rust crate exposes a PyO3 extension module named
tokmat_polars, and Polars can load the compiled plugin functions from that
module path. Python support starts at 3.12.
Typical local workflow:
Release workflow
PyPI releases are published from GitHub Actions using Trusted Publishing. The
release workflow builds wheels and an sdist on tag pushes that match v*, then
uploads them through pypa/gh-action-pypi-publish. The same tag also publishes
the Rust crate to crates.io using a CARGO_REGISTRY_TOKEN GitHub secret.
Release steps:
Before the first release, configure this repository as a Trusted Publisher in
the PyPI project settings and set the workflow environment to pypi if PyPI
prompts for it. For crates.io, create an API token and store it in GitHub as
CARGO_REGISTRY_TOKEN.
Minimal Python usage:
= .
=
Plugin API
tokmat-polars exposes two Polars plugin functions:
tokenize_exprextract_expr
tokenize_expr
Required kwargs:
model_path
Returns a struct column with:
raw_valuetokenstypesclasses
extract_expr
Required kwargs:
model_pathpattern
Optional kwargs:
mode
Supported mode values:
whole(default)startendany
When extract_expr receives a tokenized struct column, it uses the embedded
raw_value field when computing complements. This preserves any-mode
behavior and keeps complement output aligned with the original text rather than
with a placeholder reconstruction.
Example:
=
This returns capture fields plus a complement column. In the example above,
the complement contains "ATTN " because the TEL pattern matches only the
embedded address portion.