burn_dragon_tokenizer
Tokenizer primitives for the burn_dragon workspace.
This crate currently provides a compact GPT-style byte-pair tokenizer used by Dragon language experiments. It is a Rust library crate in this workspace; the optional PyO3 bindings in the code are for local experimentation and are not the published package contract.
features
- GPT-4 style regex pre-tokenization by default
- parallel BPE training with
rayon - encode/decode helpers over byte-level token ids
- optional
python-bindingsandextension-modulefeatures for local PyO3 experiments
rust use
use Tokenizer;
let mut tokenizer = new;
tokenizer.train_from_texts?;
let ids = tokenizer.encode;
let text = tokenizer.decode_to_string?;
local checks
The workspace-level smoke and CI commands remain the source of truth for release readiness.