Sentience tokenizer
Tiny zero‑dependency tokenizer for simple DSLs and config/query languages in Rust.
Generic: drop it into parsers, rule engines, interpreters, or build tooling.
Supports identifiers, numbers, strings, operators, and a small set of keywords.
Clear spans, a streaming iterator, and a zero‑copy mode when you want pure speed.
Designed for speed, clarity, and easy embedding.
Quick start
Install:
[]
= "0.2.0"
Basic usage:
use tokenize;
Streaming iterator (no allocation of full token vec):
use tokenize_iter;
for item in tokenize_iter
Zero-copy tokens (borrow &str slices from the source):
use ;
let toks = tokenize_borrowed.unwrap;
assert!;
assert!;
Features
- Zero dependencies (only std).
- Token kinds: identifiers, numbers, strings, parens/brackets/braces,
= + - * / ->. - Keywords:
true false if then else let rule and or. - Spans included for each token.
- Iterator API:
tokenize_iteryieldsResult<Token, LexError>. - Zero-copy API:
tokenize_borrowedreturnsBorrowedToken<'_>/BorrowedTokenKind<'_>with&strslices. - Whitespace & // comments skipped.
Optional features
serde: deriveSerialize/Deserializefor tokens and errors- zero-copy API:
tokenize_borrowedreturnsBorrowedTokenKind<'a>/BorrowedToken<'a>with&strslices (strings keep raw escapes)
Spec
| Aspect | Rules |
|---|---|
| Identifiers | ASCII: [A-Za-z_][A-Za-z0-9_]* |
| Numbers | Decimal integers/decimals; optional exponent e|E[+\-]d+. Single dot allowed once; .. is not consumed by numbers. |
| Strings | Double-quoted. Escapes: \n, \t, \r, \", \\. Unknown escapes = error. Raw newlines are accepted. |
| Comments | // to end-of-line. |
| Delimiters | ( ) { } [ ] , : ; |
| Operators | =, +, -, *, /, -> |
| Keywords | true, false, if, then, else, let, rule, and, or |
The enum TokenKind, types Token/Span, functions tokenize/tokenize_iter, LineMap, and error types LexError{Kind} are part of the stable API.
Note: new TokenKind variants may be added in minor releases; avoid exhaustive match without a _ catch-all.
Error Reporting
Lexing errors return a LexError with kind and span. Example with LineMap:
use ;
let src = "\"abc\\x\"";
let map = new;
let err = tokenize.unwrap_err;
let = map.to_line_col;
println!;
Output
Stable API surface
- Types:
TokenKind,Token,Span,BorrowedTokenKind<'a>,BorrowedToken<'a> - Functions:
tokenize(&str) -> Result<Vec<Token>, LexError>,tokenize_iter(&str),tokenize_borrowed(&str) -> Result<Vec<BorrowedToken<'_>>, LexError> - Utilities:
LineMapfor byte→(line, col) - Errors:
LexError,LexErrorKind
Iterator API example
use ;
Zero-copy example
use ;
Install
Add to Cargo.toml:
[]
= "0.2.0"
Example
use tokenize;
Output (truncated)
Let @18..21
Rule @22..26
Ident("greet") @27..32
LParen @32..33
Ident("name") @33..37
RParen @37..38
Eq @39..40
String("hi, ") @41..47
Plus @48..49
Ident("name") @50..54
...
Run tests
Example binary
|
Dev
Benchmark
Fuzzing
Fuzzing is supported via
cargo-fuzz (optional).
Why?
- Small, standalone lexer - no macros, no regexes.
- Useful as a foundation for parsers, DSLs, or interpreters.
- Explicit spans for better error reporting.
Background
For more context and design motivation, see my blog post:
Designing a zero-dependency tokenizer in Rust
License
MIT © 2025 Nenad Bursać