pred_recdec/lib.rs
1//! # Predicated Recursive Descent
2//!
3//! aka pred recdec
4//!
5//! It is **VERY HIGHLY RECOMMENDED** that you use the mimalloc crate (or some other high-performance memory allocator) if you use this library:
6//!
7//! ```rust
8//! use mimalloc::MiMalloc;
9//! #[global_allocator]
10//! static GLOBAL: MiMalloc = MiMalloc;
11//! ```
12//!
13//! You want:
14//! - [bnf::bnf_to_grammar]
15//! - [bnf::Grammar]
16//! - [bnf::tokenize]
17//! - [bnf::Token]
18//! - [ast::ASTNode]
19//! - [ast::parse]
20//!
21//! This library provides a way to write and run [BNF](https://en.wikipedia.org/wiki/Backus–Naur_form) grammars with annotations that make them behave like a handwritten recursive descent parser.
22//!
23//! The structure of the BNF corresponds to the structure of the resulting ASTs. Tokenization is handled automatically; you do not need to write a lexical grammar. The tokenizer handles comments, whitespace, maximal munch, and even regex tests. Unless you're trying to do something silly like how Lua uses context-sensitive long brackets for comments, that's all you need.
24//!
25//! The resulting parser is scannerful but tolerant of soft keywords. It performs no memoization or backtracking. Impure hooks are safe. There's no lookahead generation or guessing: the parser only does exactly what you specify in the BNF, in order.
26//!
27//! This is strong enough to parse C. Yes, the language with the super horrible grammar that needs arbitrary lookahead and state munging to parse correctly.
28//!
29//! This is *not* a PEG or packrat thing.
30//!
31//! Performance: varies between 50% and 400% slower than a native machine code handwritten parser. For more complex grammars, the performance loss is less. Totally usable!
32//!
33//! Simple example (totally not lisp), capable of parsing the input `(a b (q x)kfwaiei i 9 (af0f1a) () () )`:
34//!
35//! ```text
36//! S ::= @peek(0, "(") parenexpr
37//! parenexpr ::=
38//! @peek(1, ")") "(" ")"
39//! | "(" itemlist ")"
40//! itemlist ::=
41//! @peek(0, ")") #end of list
42//! | item $become itemlist
43//! item ::=
44//! @peek(0, "(") parenexpr
45//! | @auto r`[a-zA-Z_][a-zA-Z_0-9]*|[0-9\.]*`r
46//! ```
47//!
48//! ## Example usage
49//!
50//! ```rust
51//! let mut g = bnf_to_grammar(&grammar_source).unwrap();
52//! let tokens = tokenize(&mut g, &test_source);
53//! let tokens = tokens.unwrap();
54//!
55//! use std::rc::Rc;
56//! let ast = parse(&g,
57//! "S", // Name of your root-most rule
58//! &tokens[..],
59//! Rc::new(<_>::default()), // guards (see test in `src/c.rs` for a detailed usage example)
60//! Rc::new(<_>::default()) // hooks (same)
61//! );
62//!
63//! if let Ok(ast) = &ast
64//! {
65//! print_ast_pred_recdec(ast, &g.string_cache_inv, 0);
66//! }
67//! drop(ast.unwrap());
68//! ```
69//!
70//! ## BNF Extensions
71//!
72//! Mini glossary: nonterminal = "call of another rule", terminal = "immediate match of a token's contents".
73//!
74//! Common EBNF syntax like `[]` or `()?` is not supported. Don't worry, you can make flat lists just fine.
75//!
76//! Extensions from pure BNF are:
77//!
78//! `\` - End-of-line escaping with `\`, so you can write multiline regexes.
79//!
80//! `#` - Comments until the end of the line, outside of strings/regexes.
81//!
82//! `"...` - Terms starting with `"` are inline strings. They register with the tokenizer and check the entire body of a token. (Strings are interned, so this is O(1).)
83//!
84//! Terms starting with ```r`...```, ```R`...```, or ```A`...``` are inline regexes:
85//! - ```r`...`r``` registers with the tokenizer and does a full token match (i.e. it's given an implicit trailing `\z` during token matching).
86//! - ```R`...`r``` DOES NOT register with the tokenizer, but does a full token match (i.e. it's given an implicit trailing `\z`). This is provided as an optimization: sometimes you want to combine tokenizer regexes into a single regex for performance reasons, which should be done manually.
87//! - ```A`...`r``` DOES NOT register with the tokenizer, and only checks against the start of the token. This is only provided as an optimization: ```R``r``` is strictly more powerful.
88//!
89//! Regex results are cached, so checking them is amortized O(1).
90//!
91//! Terms beginning with `!` currently only have one kind:
92//! - `!hook`, e.g. `!hook(fix_infix_expr)`, are calls to user-provided code. This is **allow to be impure**, e.g. management of **typedef symbol tables**.
93//!
94//! Terms beginning with `@` are guards/predicates. They determine, at the start of a given alternation/production, whether it is valid. If true, that production is taken, and none of the others will be attempted (at this location). Remember, there's no backtracking or magic.
95//! - `@peek(N, STRING)` - Checks if, relative to the parse position, the given token contains the given text.
96//! - `@peekr(N, REGEX)` - Same, but only for regexes.
97//! - `@auto item` - Desugars to `@peek(0, item) item` or `@peekr(0, item) item` automatically.
98//! - `@guard(name)` - Calls user-provided code to determine if the production is accepted.
99//! - `@recover` - Pseudo-guard, is never attempted. Instead, it tells the associated grammar rule that if it fails, it's allowed to recover (into a poisoned state) by seeking for a particular set of tokens.
100//! - `@recover_before` - Same, but stops right before accepting any seek token instead of consuming it.
101//!
102//! Terms starting with `$` are directives:
103//! - `$become nonterminal` performs a tail call, keeping the current AST node name. This can be used to build lists without smashing the stack.
104//! - `$become_as nonterminal` performs a tail call, replacing the current AST node's name with that of the target.
105//! - `$any` matches and includes any one token as a terminal.
106//! - `$pruned` specifies that this particular production doesn't generate AST nodes for bare terminals. This is useful for reducing AST bloat. For example, `@peek(1, ")") "(" ")" $pruned` is a non-empty production but produces zero AST nodes.
107//! - TODO: `$hoist`, `$drop`, `$dropifempty`, `$rename`
108//!
109//! You'll note that there's no "negative rule-match-check predicate" extension (e.g. no "parse A, but only if it doesn't also parse B"). This is by design. Rule-level negation is way too powerful, and requires an extremely sophisticated parser generator (e.g. packrat) to handle properly. For any reasonably simple implementation, it would be incompatible with impure hooks. `__RESERVED_WORDS`, described below, is the only exception, because it's easy to define in a sane way.
110//!
111//! For negative token predicates (peeks), you can refactor the grammar slightly, or if it's a particularly complicated peek, write a custom guard. So this isn't a limitation.
112//!
113//! ## Magic pseudo-rules
114//!
115//! The following magic pseudo rule names are available (e.g. `__COMMENTS ::= //`):
116//!
117//! - `__BRACKET_PAIRS` e.g. `::= ( ) | { }` - Tell the tokenizer to pair-up these tokens with each other so that hooks can skip over their contents in O(1) time.
118//! - `__COMMENTS` e.g. `::= "//" | "#"` -- Tell the tokenizer that this is a kind of single-line comment.
119//! - `__COMMENT_PAIRS` e.g. `::= /* */` - Tell the tokenizer that this is a kind of pair-based comment.
120//! - `__COMMENT_PAIRS_NESTED` - Same, but nesting, like in Rust.
121//! - `__COMMENT_REGEXES` - Same, but formed as a regex. These are slower than the above, because the Rust `regex` crate doesn't have a JIT.
122//! - `__RESERVED_WORDS` - e.g. `::= auto break case` - Specifies a list of token contents that are not allowed to be "accepted" by regex terminals like ```r`[a-zA-Z_]+`r```
123
124pub mod bnf;
125pub mod ast;
126mod json;
127mod c;
128
129// Thing
130pub (crate) use rustc_hash::FxBuildHasher as HashBuilder;
131
132#[cfg(test)]
133mod test {
134 #[test]
135 fn test_lib() {
136 use crate::*;
137 pub use bnf::*;
138 pub use ast::*;
139
140 let grammar_source = r#"
141 S ::= @peek(0, "(") parenexpr
142 parenexpr ::=
143 @peek(1, ")") "(" ")" $pruned
144 | "(" $become itemlist $pruned
145 itemlist ::=
146 @peek(0, ")") $pruned ")"
147 | @peek(0, "(") parenexpr $become itemlist
148 | @auto r`[a-zA-Z_][a-zA-Z_0-9]*|[0-9\.]*`r $become itemlist
149 "#;
150
151 let test_source = r#"
152 (a b (q x)kfwaiei i 9 (af0f1a) () () )
153 "#;
154
155 let mut g = bnf_to_grammar(&grammar_source).unwrap();
156 let tokens = tokenize(&mut g, &test_source);
157 let tokens = tokens.unwrap();
158
159 use std::rc::Rc;
160 let ast = parse(&g, "S", &tokens[..], Rc::new(<_>::default()), Rc::new(<_>::default()));
161
162 if let Ok(ast) = &ast
163 {
164 print_ast_pred_recdec(ast, &g.string_cache_inv, 0);
165 }
166 drop(ast.unwrap());
167 }
168}