Minimal Parsing Language (MPL)
A small parsing language for Rust built around a single rule form A = B C / D,
with a static-codegen backend (FastParse) that beats pest, peg, nom,
winnow, chumsky, and even serde_json on equal-work recognition benchmarks.
MPL is deliberately minimal: TDPL/GTDPL is equivalent to PEG in recognition power,
and one rule shape is enough to encode sequence, choice, recursion, repetition,
and negative predicates. The compile-time analyzer (mpl-macro) detects char-class
cascades, multi-prefix dispatch, tail loops, and direct left recursion in your
grammar, then generates one inlined Rust function per rule with the appropriate
fast path automatically.
Highlights
- Static code generation via
#[derive(FastParse)]reads your.mplgand emits one function per rule — noHashMaprule lookup, no trait dispatch. - Six automatic optimizations at codegen time:
- char-class cascade →
[bool; 256]byte table - multi-prefix cascade →
match input[pos]first-byte dispatch R = X R / ()→loop {}(LLVM TCO is unreliable forResult<u32,()>)- direct left recursion → Squirrel-style seed-growing, position-only memo
- Mizushima cut analysis: rules with disjoint FIRST sets are exempt from memoisation
- thin
CheckStatefor recognition-only paths (peg-compatible calling convention)
- char-class cascade →
- Squirrel left recursion:
R = R x / Yworks directly — no grammar rewriting, noprecedence!macro, no manual seed handling. - Beats every comparator measured (criterion, sample size 100,
RUSTFLAGS=-C target-cpu=native).
Performance
All numbers below are recognition-only (bool return), apples-to-apples.
Lower is better; first-place per row in bold.
Parentheses (no backtracking, depth N)
| Size | mpl-fast-check | peg | winnow | nom | pest | chumsky |
|---|---|---|---|---|---|---|
| nested-1024 | 1.96 µs | 7.06 µs | 7.17 µs | 6.91 µs | 85.6 µs | 20.3 µs |
| nested-4096 | 8.12 µs | 29.3 µs | 27.5 µs | 26.7 µs | 333 µs | 77.4 µs |
| flat-1024 | 1.80 µs | 3.86 µs | 3.76 µs | 3.99 µs | 74.4 µs | 14.2 µs |
JSON (full subset with escapes)
mpl/json.mplg mirrors the JSON spec for whitespace, string escapes
(\\n, \\t, \\\\, \\", etc.), objects, arrays, integers, and the
literals. peg and pest grammars match the same subset.
Each parser library produces a different default output shape, so we stratify the comparison into three apples-to-apples tiers.
Tier 1: Recognition only (no parse tree)
| Shape (~ size) | mpl-fast-gen | peg () |
mpl/peg |
|---|---|---|---|
| json-1024 (~85 KB) | 130 µs | 241 µs | mpl 1.86× faster |
| canada-1024 | 17.8 µs | 41.0 µs | mpl 2.31× faster |
| twitter-256 | 31.5 µs | 56.5 µs | mpl 1.79× faster |
| citm-1024 | 82.3 µs | 159 µs | mpl 1.94× faster |
Tier 2: Flat token / Pair queue construction
mpl-fast-gen-tokens is fast_parse (Vec<Token>); pest always
emits its Pair queue (Vec<QueueableToken>). Both produce navigable
token streams that the caller walks — same kind of work.
| Shape | mpl-fast-gen-tokens | pest | mpl/pest |
|---|---|---|---|
| json-1024 | 1.34 ms | 1.57 ms | mpl 1.17× faster |
| canada-1024 | 378 µs | 229 µs | pest 1.65× faster |
| twitter-256 | 334 µs | 368 µs | mpl 1.10× faster |
| citm-1024 | 970 µs | 1.17 ms | mpl 1.21× faster |
pest wins on canada-1024 because mpl's single-rule TDPL form emits
a Start/End pair per variable per match — finer-grained than pest's
non-silent-rule queue. For the deeply-nested coordinate arrays in
canada this multiplies the token count. Every other shape mpl wins.
Tier 3: Typed AST construction
Only serde_json builds a typed Value tree directly — that is much
more work than recognition or flat tokens, so a fair direct comparison
would need an mpl::Token → Value walker (not yet shipped). For
reference, here is what each tier looks like on the same input:
| Shape | mpl-fast-gen-recognize | mpl-fast-gen-tokens | serde_json (Value) |
|---|---|---|---|
| json-1024 | 130 µs | 1.34 ms | 572 µs |
| canada-1024 | 17.8 µs | 378 µs | 85.2 µs |
| twitter-256 | 31.5 µs | 334 µs | 111 µs |
| citm-1024 | 82.3 µs | 970 µs | 399 µs |
serde_json is the de-facto reference and is between the recognition
tier and the token tier — it builds nested Vec<Value> / BTreeMap
allocations directly, which is faster than mpl's per-variable Start/End
emission for these shapes but slower than pure recognition.
What "AST output" means in mpl
mpl-fast-gen exposes three output tiers; the row labels in the bench tables map to these directly.
| Method | Output | Tier |
|---|---|---|
fast_recognize_only(&[u8]) -> bool |
nothing — just a yes/no | Tier 1 |
fast_recognize(&[u8]) -> bool |
builds tokens internally; returns bool | Tier 2 (close to pest's parse(...).is_ok()) |
fast_parse(&[u8]) -> Result<Vec<Token>, ParseError> |
flat token buffer | Tier 2 (apples-to-apples with pest's Pairs) |
The flat token buffer is navigable as a typed AST view via
[mpl::fast::ParseTree] — Pair/Leaf/NodeView/Children provide
zero-copy iteration, span ranges, and parent/child traversal. This
matches pest's Pairs API.
Building a typed Value-style enum from the token buffer (e.g. for a
JSON deserializer) is a small additional walker that the caller writes;
the cost is added on top of fast_parse.
Left-recursive (LSum = LSum N / N)
peg here uses #[cache_left_rec] (van Rossum seed-growing, the same family
as MPL's Squirrel pass).
| Size | mpl-fast-gen | peg cache-lr | mpl-squirrel-recognize (trait) |
|---|---|---|---|
| lsum-64 | 2.25 µs | 2.54 µs | 30.1 µs |
| lsum-256 | 8.42 µs | 10.7 µs | 119 µs |
| lsum-1024 | 24.5 µs | linear | 397 µs |
Across lsum-4 → lsum-1024, mpl-fast-gen scales 4×-per-step (linear) just
like peg, and beats it by 1.04×–1.27× on every size measured.
Bench harness and grammars are in examples/paren_bench/.
Quick Start
Cargo.toml:
[]
= "0.3"
= "0.2"
The fast path is #[derive(FastParse)]. Write a grammar file and derive:
use FastParse;
;
Where parens.mplg:
Open = { Char('(') } Parentheses / ()
Parentheses = Open Close / f
Close = { Char(')') } Open / f
The derive emits four entry points on the parser struct:
| Method | What it returns | When to use |
|---|---|---|
fast_recognize_only(&[u8]) -> bool |
bool, no AST | hot recognition; thin CheckState, peg-compatible calling convention |
fast_recognize(&[u8]) -> bool |
bool, builds tokens internally | parity with pest (always materialises a token queue) |
fast_parse(&[u8]) -> Result<Vec<Token>, ()> |
flat token buffer | walk the parse tree |
fast_recognize_in(&[u8], &Bump) -> bool |
bool, externally-supplied arena | high-throughput pipelines that reset Bump between parses |
For an AST view over the flat token buffer, use [mpl::fast::ParseTree] —
it provides Pair / Leaf / Children iterators without any extra allocation.
Three API tiers
| Tier | API | Codegen | Tradeoff |
|---|---|---|---|
| 1. Hand-rolled | Parser::parse (trait) |
none — interprets RightRule data |
full control over rules, slowest |
| 2. Legacy derive | #[derive(Parse)] |
generates Rules HashMap impl |
API stable since 0.2; same speed as tier 1 |
| 3. Fast derive | #[derive(FastParse)] |
one static fn per rule + auto-optimisations |
fastest path; recommended for new code |
Tiers 1 and 2 stay supported for back-compat. New code should use tier 3.
Migration: typically a one-line change from #[derive(Parse)] to #[derive(FastParse)],
plus replacing parser.parse(...) with parser.fast_parse(...).
How the codegen earns its speed
The FastParse derive runs four compile-time analyses against your grammar
and picks the cheapest valid lowering for each rule. Concretely, for the
included BubFns (99-rule arithmetic) grammar:
OrOrExpr = AndAndExpr OrOrExpr1 / AndAndExprandPower = Atom PowerAndFactor / Atom→ flagged as needing memoisation by Mizushima cut analysis (FIRST sets overlap). Other 92 rules are LL(1)-disjoint and skip the cache entirely.Variable = UppercaseX () / Variable1andDecDigit = '0' () / DecDigit1cascades → collapsed into[bool; 256]byte tables.Atom = ExprInParens / FloatLiteral / IntegerLiteral / Function / Variable / Constant→ first-byte dispatch (singlematch input[pos]).ZeroOrMoreDecDigits = DecDigit ZeroOrMoreDecDigits / ()→while let Ok(p) = parse_dec_digit::<false>(state, p) { ... }(no recursion).
Recognition-only callers go through a thin CheckState (input + furthest pos
only); emit-mode callers go through ParserState and produce flat Token
buffers (Sampson-style — no nested boxes).
Left recursion: Squirrel algorithm
When a rule has the shape R = R X / Y, FastParse automatically routes
recognition through Squirrel-style seed-growing memoisation. No grammar
rewriting needed — the algorithm is invoked transparently.
LSum = LSum N / N
N = { Char('1') } () / N1
...
Generated code holds a position-only (rule_id, pos) -> Option<u32> memo
per parse and iterates seed expansion until convergence. This matches peg's
#[cache_left_rec] family and was the basis on which we beat peg on
lsum-256 and lsum-1024.
Legacy APIs (tier 1 & 2)
The original 0.2 APIs remain supported. Pick one only if you need
features FastParse does not yet cover (custom Output type integration,
runtime-built grammars, or non-byte input).
Tier 2: legacy #[derive(Parse)]
Generates a Rules HashMap impl plus the Parser trait method calls.
Identical perf characteristics to tier 1; identical AST shape.
parentheses.mplg:
Open = { Char('(') } Parentheses / ()
Parentheses = Open Close / f
Close = { Char(')') } Open / f
main.rs:
use Parser;
use StartAndLenSpan;
use AST;
use Parse;
;
The derive strips a trailing Parser from the struct name and uses the
remaining stem (here Paren) to generate ParenVariable and ParenRules.
Tier 1: build a parser in Rust
If you want the same one-rule model with complete control over rules and outputs, define them directly in Rust.
use crate*;
use Parser;
use ;
use ;
use Output;
use ;
use AST;
use HashMap;
;
/// ```
/// Open = '(' Parentheses / ()
/// Parentheses = Open Close / f
/// Close = ")" Open / f
/// ```
Examples
MPL
Definition of MPL grammar
A MPL grammar G is a tuple G = (V, Σ, R, S) in which:
Vis a finite set of variables.Σis a finite set of original terminal symbols.Tis an union ofΣorM(Σ ∪ M) (M(= {(), f}) is a finite set of metasymbols).Ris a finite set of rules of the formA = B C / D
A in V (A ∈ V),
B, C, D in E (E = T ∪ V) (T ∩ V = ∅) (B, C, D ∈ E).
For any variable A there is exactly one rule with A to the left of=.
- S in V (S ∈ V) is the start variable.
Empty
() is a metasymbol that always succeeds without consuming input.
Empty = /
Failure
f is a metasymbol that always fails without consuming input.
Failure = f f / f
Extended MPL
Since one of the goals of MPL is to create an AST, it also supports two features in terms of ease of use and speed.
Any
? is a metasymbol representing any single input like wildcard character. This succeeds if there is any input left, and fails if there is no input left.
Any = ? / f
To extend the difinition of MPL grammar, let ? ∈ M.
All
* is a metasymbol representing All remaining input like wildcard character. This will succeed even if the remaining inputs are zero.
All = * / f
Same as All = ? All / ().
To extend the difinition of MPL grammar, let * ∈ M.
Difference between TDPL and MPL
The core design choice in MPL is to reduce the grammar to one rule form. TDPL has two rule forms, and PEGs were later shown to be reducible to TDPL/GTDPL with equivalent effective recognition power.
A..BC/D, A,B,C,D in V.
A..a, a in ∑ ∪ {ε, f}, f is a metasymbol not in ∑ and ε is the null string.
MPL, on the other hand, keeps everything in a single form:
A = B C / D
That is the main simplification and the main selling point of the language: MPL aims to keep TDPL/PEG-class expressive power in a much smaller surface syntax.
MPLG (MPL Grammar) syntax
In MPL grammar
// Hierarchical syntax
Mplg = ZeroOrMoreLines / f
ZeroOrMoreLines = Line ZeroOrMoreLines /
Line = Line1 EndOfLine / f
Line1 = LineComment / Line2
Line2 = Rule /
Rule = Variable Rule1 / f
Rule1 = " = " Rule2 / f
Rule2 = E Rule3 / f
Rule3 = Space Rule4 / f
Rule4 = E Rule5 / f
Rule5 = " / " Rule6 / f
Rule6 = E / f
E = TerminalSymbol / Variable
// Lexical syntax
// Variable
Variable = Identifier / f
// Terminal symbol
TerminalSymbol = MetasymbolLiteral / OriginalSymbolExpr
// Expr
Expr = ExprWithoutBlock / f
// Without Block
ExprWithoutBlock = LiteralExpr / ExprWithoutBlock1
ExprWithoutBlock1 = StructExpr / f
// Struct
StructExpr = StructExprStruct / StructExpr1
StructExpr1 = StructExprTuple / StructExprUnit
StructExprStruct = f f / f
StructExprTuple = PathInExpr StructExprTuple1 / f
StructExprTuple1 = '(' StructExprTuple2 / f
StructExprTuple2 = ZeroOrMoreExpr ')' / f
ZeroOrMoreExpr = Expr / f
StructExprUnit = PathInExpr / f
// PathInExpr
PathInExpr = ZeroOrOneDoubleColon OneOrMorePathExprSegment / f
ZeroOrOneDoubleColon = "::" /
OneOrMorePathExprSegment = PathExprSegment / f
PathExprSegment = PathIdentSegment PathExprSegment1 / f
PathExprSegment1 = "::" GenericArgs /
PathIdentSegment = Identifier / f
GenericArgs = f f / f
// Literal
LiteralExpr = CharLiteral / LiteralExpr1
LiteralExpr1 = StringLiteral / LiteralExpr2
LiteralExpr2 = IntegerLiteral / f
// Metasymbol
MetasymbolLiteral = EmptyLiteral / MetasymbolLiteral1
MetasymbolLiteral1 = FailureLiteral / MetasymbolLiteral2
MetasymbolLiteral2 = AnyLiteral / MetasymbolLiteral3
MetasymbolLiteral3 = AllLiteral / f
EmptyLiteral = "()" / f
FailureLiteral = 'f' / f
AnyLiteral = '?' ZeroOrMoreAny / f
ZeroOrMoreAny = '?' ZeroOrMoreAny /
AllLiteral = '*' / f
// Original symbol
OriginalSymbolExpr = "{ " OriginalSymbolExpr1 / f
OriginalSymbolExpr1 = ExprWithoutBlock " }" / f
// Char
CharLiteral = '\'' CharLiteral1 / f
CharLiteral1 = InnerCharLiteral '\'' / f
InnerCharLiteral = NotCharLetter InnerCharLiteral1 / f
NotCharLetter = '\'' * /
InnerCharLiteral1 = QuoteEscape / ?
// String
StringLiteral = '"' StringLiteral1 / f
StringLiteral1 = InnerStringLiteral '"' / f
InnerStringLiteral = InnerStringLiteralLetter InnerStringLiteral /
// InnerStringLiteralLetter
InnerStringLiteralLetter = NotStringLetter InnerStringLiteralLetter1 / f
NotStringLetter = '"' * /
InnerStringLiteralLetter1 = QuoteEscape / ?
// Integer
IntegerLiteral = IntegerLiterals / f
IntegerLiterals = DecLiteral / f
DecLiteral = DecDigit ZeroOrMoreDecDigit / f
ZeroOrMoreDecDigit = DecDigitOrUnderscore ZeroOrMoreDecDigit /
DecDigitOrUnderscore = DecDigit / '_'
// IDENTIFIER
Identifier = Uppercase ZeroOrMoreIdentifierContinue / f
ZeroOrMoreIdentifierContinue = IdentifierContinue ZeroOrMoreIdentifierContinue /
IdentifierContinue = Alphabet / DecDigit
// Letters
Alphabet = Lowercase / Uppercase
// Lowercase
LowercaseAToF = LowercaseAToF1 / f
LowercaseAToF1 = 'a' / LowercaseAToF2
LowercaseAToF2 = 'b' / LowercaseAToF3
LowercaseAToF3 = 'c' / LowercaseAToF4
LowercaseAToF4 = 'd' / LowercaseAToF5
LowercaseAToF5 = 'e' / LowercaseAToF6
LowercaseAToF6 = 'f' / f
Lowercase = LowercaseAToF / Lowercase1
Lowercase1 = 'g' / Lowercase2
Lowercase2 = 'h' / Lowercase3
Lowercase3 = 'i' / Lowercase4
Lowercase4 = 'j' / Lowercase5
Lowercase5 = 'k' / Lowercase6
Lowercase6 = 'l' / Lowercase7
Lowercase7 = 'm' / Lowercase8
Lowercase8 = 'n' / Lowercase9
Lowercase9 = 'o' / Lowercase10
Lowercase10 = 'p' / Lowercase11
Lowercase11 = 'q' / Lowercase12
Lowercase12 = 'r' / Lowercase13
Lowercase13 = 's' / Lowercase14
Lowercase14 = 't' / Lowercase15
Lowercase15 = 'u' / Lowercase16
Lowercase16 = 'v' / Lowercase17
Lowercase17 = 'w' / Lowercase18
Lowercase18 = 'x' / Lowercase19
Lowercase19 = 'y' / Lowercase20
Lowercase20 = 'z' / f
// Uppercase
UppercaseAToF = UppercaseAToF1 / f
UppercaseAToF1 = 'A' / UppercaseAToF2
UppercaseAToF2 = 'B' / UppercaseAToF3
UppercaseAToF3 = 'C' / UppercaseAToF4
UppercaseAToF4 = 'D' / UppercaseAToF5
UppercaseAToF5 = 'E' / UppercaseAToF6
UppercaseAToF6 = 'F' / f
Uppercase = UppercaseAToF / Uppercase1
Uppercase1 = 'G' / Uppercase2
Uppercase2 = 'H' / Uppercase3
Uppercase3 = 'I' / Uppercase4
Uppercase4 = 'J' / Uppercase5
Uppercase5 = 'K' / Uppercase6
Uppercase6 = 'L' / Uppercase7
Uppercase7 = 'M' / Uppercase8
Uppercase8 = 'N' / Uppercase9
Uppercase9 = 'O' / Uppercase10
Uppercase10 = 'P' / Uppercase11
Uppercase11 = 'Q' / Uppercase12
Uppercase12 = 'R' / Uppercase13
Uppercase13 = 'S' / Uppercase14
Uppercase14 = 'T' / Uppercase15
Uppercase15 = 'U' / Uppercase16
Uppercase16 = 'V' / Uppercase17
Uppercase17 = 'W' / Uppercase18
Uppercase18 = 'X' / Uppercase19
Uppercase19 = 'Y' / Uppercase20
Uppercase20 = 'Z' / f
QuoteEscape = "\\'" / "\\\""
EndOfLine = "\r\n" / '\n'
Space = ' ' / f
// Digits
BinDigit = "0" / "1"
OctDigit = BinDigit / OctDigit1
OctDigit1 = "2" / OctDigit2
OctDigit2 = "3" / OctDigit3
OctDigit3 = "4" / OctDigit4
OctDigit4 = "5" / OctDigit5
OctDigit5 = "6" / OctDigit6
OctDigit6 = "7" / f
DecDigit = OctDigit / DecDigit1
DecDigit1 = "8" / DecDigit2
DecDigit2 = "9" / f
// Comment
LineComment = "//" InnerLineComment / f
InnerLineComment = AnyExceptLF InnerLineComment /
AnyExceptLF = AnyExceptLF1 ? / f
AnyExceptLF1 = EndOfLine * /
TODO
Tasks
- into_first() in CST
- Add { Original } in mplg
- Add functions that easy to get Variable from AST
- Add RowColSpan
- Create parser from MPLG file.
- Error Handling
- Packrat Parsing
- Left Recursion
Next implementation
- Add functions that easy to get Variable from AST
- Can be Variable in Leaf Node
- Error Handling
Practice
Sequence
A <- e1 e2
A = e1 e2 / f
Choice
A <- e1 / e2
A = e1 / e2
Zero or more
A <- e*
A = e A /
Not predicate
A <- !e ?
A = B ? / f
B = e * /
References
Foundations of TDPL / PEG / packrat parsing
- Alexander Birman. The TMG Recognition Schema. PhD thesis, Princeton University, February 1970. (TDPL — the "one rule" formulation that MPL specialises.)
- Alfred V. Aho and Jeffrey D. Ullman. The Theory of Parsing, Translation and Compiling — Vol. I: Parsing. Prentice Hall, 1972.
- Bryan Ford. 2002. Packrat parsing: a practical linear-time algorithm with backtracking. PhD dissertation, MIT.
- Bryan Ford. 2004. Parsing expression grammars: a recognition-based syntactic foundation. POPL 2004, pp. 111–122.
Algorithms used by mpl-macro's optimisations
- Kota Mizushima, Atusi Maeda, and Yoshinori Yamaguchi. 2010.
Packrat parsers can handle practical grammars in mostly constant
space. PASTE 2010. — FIRST-set / cut analysis used by
cascade::analyze_char_classesandcut::compute_first_setsto exclude LL(1)-disjoint rules from memoisation. - Alessandro Warth, James R. Douglass, and Todd Millstein. 2008.
Packrat parsers can support left recursion. PEPM 2008.
— seed-growing iteration used in
Parser::squirrel_parseand theFastParse-generated LR codegen. - Sérgio Medeiros, Fabio Mascarenhas, and Roberto Ierusalimschy. 2014. Left recursion in Parsing Expression Grammars. Science of Computer Programming 96, pp. 177–190.
- Luke A. D. Hutchison. 2020. Pika parsing: reformulating packrat parsing as a dynamic programming algorithm solves the left recursion and error recovery problems. arXiv preprint arXiv:2005.06444.
- Luke A. D. Hutchison. 2026. Squirrel: a unified packrat parsing
algorithm with linear-time guarantees, optimal error recovery, and
full left-recursion support. arXiv preprint
arXiv:2601.05012. — the
family of position-only seed-growing memoisation MPL's
squirrel_recognizeand theFastParseLR path implement.
Parser data-layout influences
- Adrian Sampson. Flattening ASTs (and other compiler data
structures). 2023.
cs.cornell.edu/~asampson/blog/flattening.html
— the flat
Vec<Token>representation used inmpl::fast. - Aleksey Kladov ("matklad"). Resilient LL parsing tutorial. 2023. Background on lossy / lossless syntax tree design.
- Joshua Haberman ("Reverberate"). Parsing Protobuf at 2+ GB/s: how I
learned to love tail calls in C. blog.reverberate.org, 2021.
— motivation for the explicit
loop {}codegen incascade::is_tail_loop(Rust's TCO is unreliable forResult<u32, ()>-shaped recursive returns).
All algorithms above are implemented from the published descriptions; no source code is copied. PEG / TDPL / packrat parsing and their extensions are not patented to our knowledge.
License
This project is licensed under either of:
- Apache License 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
- MIT License (LICENSE-MIT or https://opensource.org/licenses/MIT)
at your option. All runtime dependencies (bumpalo, proc-macro2,
quote, syn) are dual-licensed MIT OR Apache-2.0, compatible with
both.