derive_parser 0.1.0

Derive a parser from your CST
Documentation

This crate provides a derive macro, #[derive(Parse)] that derives a recursive descent parser for a syntax tree node based on fields' Parse implementations as well as derive macros.

[!IMPORTANT] Disclaimer: I had an interesting idea and sketched out a proof-of-concept. As of right now, that's all this is. It works, mostly, but is in no way feature-complete or efficient. If people end up showing interest, I'll rewrite it from scratch with better design choices. For now, feel free to experiment, but don't expect feature or performance parity with existing parser generators. It's worth noting that the nature of derive macros imposes a limit on how efficient an approach like this can be, since it's not possible to globally collect definitions across structs and build e.g. an LR transition table.

As of v0.x, there are NO semver guarantees; breaking changes may occur in any version.

Example

use derive_parser::{Parse, Token};

mod lexer;
use lexer::TokenKind::*; // Implemented e.g. with Logos

#[derive(Parse, Spanned)]
#[input(Token)]
struct FunctionCall {
    #[token(Ident)]
    name: Token,
    #[token(LParen)]
    _lparen: Token,
    args: Delimited<Expression, Delim>,
    #[token(RParen)]
    _rparen: Token
}

#[derive(Parse, Spanned)]
#[input(Token)]
enum Expression {
    Call(FunctionCall),
    #[token(Bool)]
    #[token(Int)]
    #[token(String(_))]
    Literal(Token)
}

#[derive(Parse, Spanned)]
#[input(Token)]
struct Delim(#[token(Comma)])

// Support stuff; we could've just implemented `Token` for `TokenKind`:

#[derive(Clone, Debug)]
struct Token {
    pub inner: lexer::TokenKind,
    pub span: Span,
    pub trailing_trivia: Option<String>,
    pub string: String,
}

impl derive_parser::Token for Token {
    type Kind = TokenKind;
    fn kind(&self) -> Self::Kind {
        self.inner
    }
}
impl Spanned for Token {
    type Span = Span;
    fn span(&self) -> Self::Span {
        self.span
    }
}

By only annotating what can't be inferred from the syntax tree structs and deriving the rest of the parser implementation, derive_parser minimizes the amount of code needed to express your parser. In addition, this keeps your CST and parser implementations in sync, which should help you avoid bugs when updating your parser. The whole thing can be also be made zero-copy by just adding a lifetime paramter to Token and the node structs.

use chumsky::{Parser, recursive::Recursive};

mod lexer;
use lexer::TokenKind; // Implemented e.g. with logos

struct FunctionCall {
    name: Token,
    _lparen: Token,
    args: Vec<Expression>,
    _rparen: Token
}

enum Expression {
    Call(FunctionCall),
    Literal(Token)
}

fn parser<I>() -> impl Parser<I, FunctionCall, Simple<I>>
where
    I: Input<Token = Token, Span = SimpleSpan<usize, ()> + ValueInput
{
    use TokenKind::*;

    let mut expression = Recursive::declare();

    let function_call = just!(Ident)
        .then(just!(LParen))
        .then(expression.separated_by(just!(Comma)).allow_trailing)
        .then(just!(RParen))
        .map(|(((name, _lparen), args), _rparen)| {
            FunctionCall { name, _lparen, args, _rparen }
        });

    expression.define(
        just!(Bool)
            .map(Expression::Literal)
            .or(just!(Int).map(Expression::Literal))
            .or(just!(String(_)).map(Expression::Literal))
            .or(function_call.clone())
    );

    function_call
}

#[derive(Clone)]
struct Token {
    pub inner: TokenKind,
    pub span: Span,
    pub trailing_trivia: Option<String>,
    pub string: String,
}

macro_rules! just {
    ($kind:pat) => {
        select! {
            t @ Token { inner: $kind, .. } => t
        }
    };
}

Even with all that, many of the convenience methods (like .span() on nodes) that derive_parser provides are still not implemented here, and I didn't even try to make this zero-copy due to the sheer amount of lifetime-juggling.

Pratt Parsing

Currently, derive_parser generates recursive descent parsers. This makes inherently left-recursive grammars like arithmetic expressions hard to represent. To solve this, a built-in pratt parser is provided:

use derive_parser::{Pratt, Precedence, Parse, Spanned};

#[derive(Parse, Spanned)]
#[input(Token)]
enum Atom {
  Paren(
    #[token(LParen)] Token,
    Box<Expression>
    #[token(RParen)] Token,
  ),
  Number(#[token(Number)] Token)
}

#[derive(Parse, Spanned)]
#[input(Token)]
struct Expression(Pratt<Operator, Atom>);

#[derive(Parse, Spanned, Precedence)]
#[input(Token)]
enum Operator {
  // Infix Operators
  #[pratt(1)]
  Add(#[token(Plus)] Token),
  #[pratt(2)]
  Mul(#[token(Times)] Token),
  
  // Prefix Operators
  #[pratt(prefix(4))]
  Sub(#[token(Minus)] Token),
  
  // Postfix operators
  #[pratt(postfix(3))]
  Fac(#[token(Bang)] Token)
}

This will parse an operator-precedence expression like 1 + 2 * 3 + 4 * -5! as (1 + (2 * 3)) + (4 * (-5)!).

Attributes

If the right-hand side of a field implements Parse, you don't need any attributes — the parser will automatically try to parse the field with its own Parse implementation.

Otherwise, you can use attributes to explain how to parse the field. Currently, the only attribute for this case is #[token(...)].

#[token(PATTERN)]

By far the most common attribute is #[token(...)]. It applies a pattern to the next input token, consuming it if it matches:

enum TokenKind {
  LParen,
  RParen
}

#[derive(Parse)]
#[input(Token)]
struct Parens {
  // Match tokens containing a `TokenKind::LParen`
  #[token(TokenKind::LParen)]
  lparen: Token,
  inner: Option<Parens>,
  #[token(TokenKind::RParen)]
  rparen: Token
}

It is common to use TokenKind::* in your parser module to avoid repetition. You may use #[token] multiple times to allow different tokens on the same field:

// -- snip --
use TokenKind::*;

#[derive(Parse)]
#[input(Token)]
struct Literal(
  #[token(Bool)]
  #[token(Int)]
  #[token(Float)]
  Token
);

Planned Features

  • Error recovery (#[required], #[recover])
  • Better Delimited API (#[delimited])