panko
A small, zero-copy text tokenizer for Rust.
Why "panko"?
Panko (パン粉) is the Japanese word for breadcrumbs — light, airy crumbs made by breaking bread into small, uniform pieces. This crate does the same thing to text: it crumbles a string into a flat stream of small, uniform tokens — words, single-character symbols, and newlines — that you can sift, sort, and reassemble.
Like real panko, the pieces are kept light: every token borrows directly from
the input slice, so tokenizing allocates only the output Vec.
What it does
panko segments arbitrary text — including Unicode and mixed scripts
(Cyrillic, emoji, …) — into a Vec<Token> where every Token is one of:
Word(&str)— a contiguous run ofis_alphanumeric()characters. Words compare case-insensitively (viaunicase), soWord("Строка")equalsWord("строкА").Symbol(&str)— any single non-alphanumeric character (punctuation, spaces, emoji, …). Symbols compare byte-exact.Newline— emitted between lines, and trailing if the input ends with\n.
On top of the tokenizer, panko ships a few small helpers:
| Function | Purpose |
|---|---|
tokenize |
&str → Vec<Token<'_>> |
tokens_to_string |
Reassemble tokens back into a String |
split_tokens |
Split a token stream by any of N separator tokens (case-insensitive) |
normalize_text |
Lowercase + collapse Russian ё to е |
fit_message |
Slice a string into ≤ N-character, UTF-8-safe chunks (caller picks N) |
Quick start
1. Add the dependency
[]
= "0.1"
2. Tokenize some text
use ;
let tokens = tokenize;
assert_eq!;
3. Case-insensitive word comparisons
use Word;
assert_eq!;
assert_eq!;
4. Split a stream by multiple separators
split_tokens cuts a token stream every time it sees a token that matches one
of the separators (using Token's case-insensitive equality), and concatenates
the rest into one String per chunk. A leading space after a separator is
dropped, so chunks read cleanly.
use ;
let tokens = ;
let separators = ;
assert_eq!;
5. Fit long messages into fixed-size chunks
fit_message slices a string into pieces of at most N characters each, where
N is supplied by the caller as a NonZeroUsize — so a zero chunk size is
impossible to construct at compile time. Useful for chat APIs with per-message
character limits (Telegram = 4096, Discord = 2000, etc.).
use NonZeroUsize;
use fit_message;
let size = new.unwrap_or;
let long = "x".repeat;
let chunks = fit_message;
assert_eq!;
assert_eq!;
Development