Expand description
Parsel, the Zero-Code Parser Generator
Parsel is a library for generating parsers directly from syntax tree node types.
The main entry point is the #[derive(Parse)]
custom derive
proc-macro, which generates an implementation of the syn::parse::Parse
trait
for the annotated AST node type. Adding #[derive(FromStr)]
also implements the standard FromStr
trait for the type, by simply forwarding to
its Parse
impl.
In addition, a #[derive(ToTokens)]
macro is provided,
for easily obtaining the source representation of a specific AST node via the
quote
crate. This in turn helps with getting its Span
due to the blanket
impl<T: ToTokens> Spanned for T
.
Adding #[derive(Display)]
also implements the standard
Display
trait for the type, by simply forwarding to its ToTokens
impl.
Furthermore, the ast
module provides a number of helper types
for common needs, such as optional productions, repetition, parenthesization, and
grouping. These are mostly lightweight wrappers around parsing collections and
parsing logic already provided by syn
. However, some very useful syn
types,
such as Option<T: Parse>
and Punctuated
, have multiple, equally valid parses,
so they don’t implement Parse
in order to avoid amibiguity. Parsel handles this
ambiguity at the type level, by splitting the set of valid parses into multiple,
unambiguously parseable types.
Examples and How It Works
The fundamental idea behind Parsel is the observation that struct
s and enum
s
directly correspond to sequences and alternation in grammars, and that they are
composable: one does not need to know the exact implementation of sub-expressions
in order to produce a parser for the current rule.
AST nodes that have a struct
type correspond to sequences: every field (whether
named or numbered) will be parsed and populated one after another, in the order
specified in the source.
AST nodes having an enum
type correspond to alternation: their variants will be
tried in order, and the first one that succeeds will be returned. Fields of tuple
and struct variants are treated in the same sequential manner as struct
fields.
Accordingly, you define your grammar by specifying the fields and variants of AST nodes, and Parsel will generate a parser from them. Let’s see what this looks like in the context of the parser and the printer for a simple, JSON-like language:
use core::iter::FromIterator;
use core::convert::TryFrom;
use parsel::{Parse, ToTokens};
use parsel::ast::{Bracket, Brace, Punctuated, LitBool, LitInt, LitFloat, LitStr};
use parsel::ast::token::{Comma, Colon};
mod kw {
parsel::custom_keyword!(null);
}
#[derive(PartialEq, Eq, Debug, Parse, ToTokens)]
enum Value {
Null(kw::null),
Bool(LitBool),
Int(LitInt),
Float(LitFloat),
Str(LitStr),
Array(
#[parsel(recursive)]
Bracket<Punctuated<Value, Comma>>
),
Object(
#[parsel(recursive)]
Brace<Punctuated<KeyValue, Comma>>
),
}
#[derive(PartialEq, Eq, Debug, Parse, ToTokens)]
struct KeyValue {
key: LitStr,
colon: Colon,
value: Value,
}
let actual: Value = parsel::parse_quote!({
"key1": "string value",
"other key": 318,
"recursive": [
1.6180,
2.7182,
3.1416,
null
],
"inner": {
"nested key": true,
"hard to write a parser": false
}
});
let expected = Value::Object(Brace::from(Punctuated::from_iter([
KeyValue {
key: LitStr::from("key1"),
colon: Colon::default(),
value: Value::Str(LitStr::from("string value")),
},
KeyValue {
key: LitStr::from("other key"),
colon: Colon::default(),
value: Value::Int(LitInt::from(318)),
},
KeyValue {
key: LitStr::from("recursive"),
colon: Colon::default(),
value: Value::Array(Bracket::from(Punctuated::from_iter([
Value::Float(LitFloat::try_from(1.6180).unwrap()),
Value::Float(LitFloat::try_from(2.7182).unwrap()),
Value::Float(LitFloat::try_from(3.1416).unwrap()),
Value::Null(kw::null::default()),
]))),
},
KeyValue {
key: LitStr::from("inner"),
colon: Colon::default(),
value: Value::Object(Brace::from(Punctuated::from_iter([
KeyValue {
key: LitStr::from("nested key"),
colon: Colon::default(),
value: Value::Bool(LitBool::from(true)),
},
KeyValue {
key: LitStr::from("hard to write a parser"),
colon: Colon::default(),
value: Value::Bool(LitBool::from(false)),
},
]))),
},
])));
assert_eq!(actual, expected);
Recursive AST Nodes and Cyclic Constraints
Most useful real-world grammars are recursive, i.e., they contain productions that
refer to themselves directly (direct recursion) or indirectly (mutual recursion).
This results in AST node types that contain pointers to the same type. Even more
importantly, it leads to cyclic constraints in the implementations of Parse
and
ToTokens
. These cyclic constraints are trivially satisfied and resolvable, but
the constraint solver of the Rust compiler is currently struggling with them due
to Issue #48214.
Thus, one must break such constraint cycles when deriving the implementations of
Parse
and ToTokens
. Parsel supports this use case by providing the attribute
#[parsel(recursive)]
, or an equivalent spelling, #[parsel(recursive = true)]
.
Adding this attribute to a field of a struct
or a variant of an enum
has the
effect of omitting all FieldType: Parse
and FieldType: ToTokens
constraints
from the where
clause of the generated Parse
and ToTokens
impls, breaking
the constraint cycle, and thus allowing the code to compile.
It is sufficient to break each constraint cycle on one single type (practically
on the one that requires adding the smallest number of #[parsel(recursive)]
annotations). However, if the grammar contains several self-referential cycles,
it is necessary to break each of them. Furthermore, if breaking a cycle requires
omitting a constraint on a type which appears in multiple fields of a struct
or
a variant, then it is necessary to add #[parsel(recursive)]
to all of those
fields.
As an example, consider the following grammar for simple Boolean operations and the accompanying comments:
use parsel::{Parse, ToTokens};
use parsel::ast::{Paren, LitBool};
use parsel::ast::token::{Or, And, Bang};
#[derive(PartialEq, Eq, Debug, Parse, ToTokens)]
enum Expr {
Or {
lhs: Conjunction,
op: Or,
#[parsel(recursive)] // break direct recursion
rhs: Box<Expr>,
},
Conjunction(Conjunction),
}
#[derive(PartialEq, Eq, Debug, Parse, ToTokens)]
enum Conjunction {
And {
lhs: Term,
op: And,
#[parsel(recursive)] // break direct recursion
rhs: Box<Conjunction>,
},
Term(Term),
}
#[derive(PartialEq, Eq, Debug, Parse, ToTokens)]
enum Term {
Literal(LitBool),
Not(
Bang,
#[parsel(recursive)] // break direct recursion
Box<Term>,
),
Group(
#[parsel(recursive)] // break mutual recursion
Paren<Box<Expr>>
),
}
let expr: Expr = parsel::parse_str("true & (false | true & true) & !false").unwrap();
assert_eq!(
expr,
Expr::Conjunction(Conjunction::And {
lhs: Term::Literal(LitBool::from(true)),
op: And::default(),
rhs: Box::new(Conjunction::And {
lhs: Term::Group(Paren::from(Box::new(Expr::Or {
lhs: Conjunction::Term(Term::Literal(LitBool::from(false))),
op: Or::default(),
rhs: Box::new(Expr::Conjunction(Conjunction::And {
lhs: Term::Literal(LitBool::from(true)),
op: And::default(),
rhs: Box::new(Conjunction::Term(Term::Literal(LitBool::from(true)))),
}))
}))),
op: And::default(),
rhs: Box::new(Conjunction::Term(Term::Not(
Bang::default(),
Box::new(Term::Literal(LitBool::from(false))),
))),
})
})
);
Dealing with Left Recursion
If you carefully examine the grammar, you can notice it’s right-recursive, i.e., the subexpression with identical precedence appears on the right-hand side, while the left-hand side descends one level to the next tightest-binding subexpression. This in turn means that consecutive operations of equal precedence will associate to the right. The reason for this is that recursive descent parsers, such as the ones generated by Parsel, fall into infinite recursion if they attempt parsing a left-recursive grammar. For instance, if our top-level expression were defined as
expr = expr '|' conjunction
| conjunction
then the code generated for expr
would immediately and unconditionally try to
call itself again.
While it is fine to rewrite the grammar as right-recursive in the case of simple Boolean expressions (since they are associative), it is generally not possible to just omit left recursion altogether from a grammar. Operations which are not associative care a lot about how they are grouped, and even e.g. basic algebraic operations such as subtraction and division are defined to be left-associative by widespread convention. Thus, it is required that Parsel support associating terms to the left. There are two ways to achieve this goal:
-
Side-step the problem by simply not representing associativity in the AST. This is done by using a helper type capable of expressing explicit repetition of arbitrary length (e.g.,
Separated
), instead of creating binary AST nodes. The repeated AST nodes will be sub-expressions at the next highest precedence level. This approach puts off the question of associativity until evaluation/codegen, that is, until tree-walking time. -
Use the
LeftAssoc
helper type. This sidesteps the problem of infinite recursion by parsing iteratively (just likeSeparated
). It then transform the resulting linear list of subexpressions into a properly left-associative (left-leaning) tree of AST nodes.Note that there is an analogous
RightAssoc
type as well. Strictly speaking, this is not necessary, because right recursion makes progress and terminates just fine. However, deriving the parse tree in an iterative manner has the advantage of recursing less, and including the right-leaning counterpart is preferable for reasons of symmetry/consistency.
Span and Error Reporting
Types that implement ToTokens
get an automatic impl Spanned for T: ToTokens
.
This means that by default, all types deriving ToTokens
will also report their
span correctly, and parse errors will have useful span information.
However, there is an important caveat regarding alternations (enum
s) in the
grammar. The way alternations can be parsed in a fully automatic and deceptively
simple way is by attempting to parse each alternative production, one after the
other, and pick the first one that parses successfully. However, if none of them
parse, then it is not obvious to the parser which of the errors it should report.
The heuristic we use to solve this problem is that we use Span
information to
select the variant that got furthest in the token stream before having failed.
This works because most “nice” context-free grammars are constant lookahead, or
even better, LL(1), i.e. single-token lookahead. This means that if a production
involving more than one token fails in the middle, it will have advanced further
in the stream than other productions, which failed right at the very first token.
However, if span information is not available or not useful (i.e., when every
generated is spanned to the same Span::call_site()
source location), then this
heuristic breaks down, and it will select an arbitrary production, resulting in
subpar error messages. This means that you should try to preserve spans as much
as possible. This in turn implies that using syn::parse_str()
for parsing code
outside procedural macros is preferable to using syn::parse2()
, because the
former will result in a usefully-spanned AST, while the latter will not, at least
not when used on e.g. a TokenStream
obtained via quote!()
or parse_quote!()
.
Roadmap, TODOs
- Configure and use Clippy
-
Add tests for various scenarios
-
Parsing of simple named-field
struct
succeeds -
Parsing of simple tuple
struct
succeeds -
Parsing of
enum
with multiple kinds of variants succeeds -
Parsing a generic
struct
succeeds (named-field as well as tuple) -
Parsing a generic
enum
succeeds (with multiple kinds of variants) - Parsing of complex tree of mutually recursive types succeeds
-
Printing of simple named-field
struct
succeeds -
Printing of simple tuple
struct
succeeds -
Priting of
enum
with multiple kinds of variants succeeds -
Printing of a generic
struct
succeeds -
Printing of a generic
enum
with multiple kinds of variants succeeds - Printing of complex tree of mutually recursive types succeeds
-
Round-trip smoke tests:
parse(print(ast)) == ast && print(parse(ts)) == ts
-
Parsing of simple named-field
- Handle generic AST node types
- Refactor into modules
-
Add
pub extern crate proc_macro2;
,pub extern crate syn;
as well aspub extern crate quote;
to the crate root-
Rewrite derive macros so that they reference
::parsel::proc_macro2
,::parsel::syn::*
and::parsel::quote::*
instead of referring to the three crates directly. This allows one to use the#[derive]
macros on their own, without having to declare and import these dependencies manually. - Also rewrite examples and tests with this in mind
-
Rewrite derive macros so that they reference
-
Add helper types for common tasks/production kinds
-
Punctuated
andMany
: at least 0 repetitions, trailing punctuation allowed -
Separated
: at least 1 repetition, trailing punctuation not allowed -
Maybe
: 0 or 1 repetition, checkig a prefix for deciding if the suffix exists -
Lit
,Bool
,Int
,Float
,Str
: strongly-typed, pre-parsed counterparts tosyn::Lit
,syn::LitBool
etc. -
LeftAssoc
: for parsing left-associative binary infix operators without infinite left recursion, and transforming them into a proper, left-leaning AST -
RightAssoc
: for parsing right-associative binary infix operators without deep right recursion - Rewrite tests using these helper types once they are added
-
-
Implement
#[derive(FromStr)]
(trivial: delegates tosyn::parse_str()
) -
Implement
#[derive(Display)]
(trivial: delegates toToTokens::to_token_stream()
) -
Implement
#[derive(ToTokens)]
, getimpl Spanned
for free - Handle recursive types (which result in unsatisfiable constraints)
-
parsel::ast::Word
type, which is a thin wrapper aroundIdent
, except that its default parse usesIdent::parse_any()
, so it can also parse identifiers that would otherwise fail because they are keywords in Rust. -
try_parse_quote!{ ... }
andtry_parse_quote_spanned!{ ... }
: non-panicking,Result
-returning equivalents ofparse_quote!
andparse_quote_spanned!
. -
define_keywords!{ ... }
macro, a correspondingKeywords
trait, and Parsel’s ownCustomIdent<Keywords>
type. These work together in order to:- Define a module full of custom keywords via
syn::custom_keyword!
- Define a unit/empty marker type that implements
Keywords
with its associatedconst KEYWORDS: &[&str]
containing a list of all custom keywords so defined; - Parse identifiers, automatically rejecting custom keywords, by looking at the
associated
KEYWORDS
list.
- Define a module full of custom keywords via
-
ast::Eof
andast::NotEof
types for asserting end-of-input conditions - Document all of the public API
- Document all of the non-public API as well
- Make the error reporting heuristic for alternation (based on the furthest parsing production) work even when span information is not useful. Nota bene: this absolutely shouldn’t be done by just counting the number of tokens/bytes in the remaining input, because that will result in an accidentally quadratic parsing performance!
-
#[derive(Grammar)]
proc-macro for generating an EBNF-style grammar from the AST, for those CS people / auditors / etc. who love reading grammars. - Code Examples, tutorial
Re-exports
Modules
This module provides helper types for common grammar idioms/needs, such as:
Macros
Define a type that supports parsing and printing a given identifier as if it were a keyword.
Define a type that supports parsing and printing a multi-character symbol as if it were a punctuation token.
Declares a module containing custom keywords as defined by custom_keyword!
, and a
Keywords
marker type for CustomIdent
implementing
KeywordList
.
Quasi-quotation macro that accepts input like the quote!
macro but uses
type inference to figure out a return type for those tokens.
This macro is [parse_quote!
] + quote_spanned!
.
Similar to syn::parse_quote!
, but instead of panicking, it returns an
Err
if the inferred type fails to parse from the specified token stream.
Similar to syn::parse_quote_spanned!
, but instead of panicking, it returns
an Err
if the inferred type fails to parse from the specified token stream.
Structs
Error returned when a Syn parser cannot parse the input tokens.
A wrapper around floats providing an implementation of Eq
, Ord
and Hash
.
A region of source code, along with macro expansion information.
An abstract stream of tokens, or more concretely a sequence of token trees.
Enums
A single token or a delimited sequence of token trees (e.g. [1, (), ..]
).
Traits
Format trait for an empty format, {}
.
Parse a value from a string
Parsing interface implemented by all types that can be parsed in a default way from a token stream.
A trait that can provide the Span
of the complete contents of a syntax
tree node.
Types that can be interpolated inside a quote!
invocation.
Functions
Parse a proc-macro2 token stream into the chosen syntax tree node.
Parse a string of Rust code into the chosen syntax tree node.
Type Definitions
The result of a Syn parser.
Derive Macros
Implements the core::fmt::Display
trait for the annotated type by forwarding to its ToTokens
implementation.
Implements the core::str::FromStr
trait for the annotated type by forwarding to its Parse
implementation.
Implements the syn::parse::Parse
trait for the annotated type.
Implements the quote::ToTokens
trait for the annotated type.