Crate chumsky[−][src]

Expand description

A friendly parser combinator crate that makes writing LL(1) parsers with error recovery easy.

Example

Here follows a Brainfuck parser. See examples/ for the full interpreter.

use chumsky::prelude::*;

#[derive(Clone)]
enum Instr {
    Invalid,
    Left, Right,
    Incr, Decr,
    Read, Write,
    Loop(Vec<Self>)
}

fn parser() -> impl Parser<char, Vec<Instr>, Error = Simple<char>> {
    use Instr::*;
    recursive(|bf| bf.delimited_by('[', ']').map(|xs| xs.map_or(Invalid, Loop))
        .or(just('<').to(Left))
        .or(just('>').to(Right))
        .or(just('+').to(Incr))
        .or(just('-').to(Decr))
        .or(just(',').to(Read))
        .or(just('.').to(Write))
        .repeated())
}

Features

Generic combinator parsing
Error recovery
Recursive parsers
Text-specific parsers & utilities
Custom error types

We can think of parsing like navigating a maze that has no cycles. As we walk towards what we hope is the exit we will, from time to time, come across splits in the path. Our job is to decide which path to take in order to reach the exit. How do we do this?

Thankfully, we have signposts at every turning in the form of upcoming tokens to parse. Unfortunately, these signposts don’t tell us exactly where to go: they just give us an idea of what obstacles might be lying ahead. We also have a way to see into the future using ‘token lookahead’. Lookahead means that at every split in the path, we get a chance to peer into the distance to see what lies ahead on each possible path before deciding to commit to it. The distance that we’re allowed to look is the ‘k’ in LL(k) parsing. Parsers capable of taking a walk into the future, realising they’ve taken a wrong turn, and retracing their steps are known as ‘backtracking’ parsers for this reason.

LL(1) parsing: limitations and advantages

Chumsky is designed for LL(1) parsing. This means that at each split in the path, we only get to see the very next token in advance before making a choice about which path to proceed down. In effect, we need to be able to figure out how to proceed using only the very next token in the input sequence. No backtracking allowed!

This may initially seem like a serious limitation and it is, at least in theory. In practice, we can eliminate many of these limitations by deploying parsing passes. A common example of multi-pass parsing is the way in which compilers for C-like languages generally perform an initial lexing (also called ‘tokenization’) pass to split the source code into logical groups of characters known as ‘tokens’ and later feed these tokens into a secondary pass to generate the program’s abstract syntax tree. This isn’t just a workaround for the limitations of an LL(1) parser either: splitting parsing into multiple passes makes writing and validating each pass simpler and allows for tools like syntax highlighters that make use of earlier passes only to be built on top of the same codebase.

All of this considered, LL(1) parsers are powerful tools that continue to be a relevant and effective route for parsing even (and, perhaps, in particular) the latest generations of programming languages.

Why use LL(1) parsing when more powerful backtracking alternatives exist? The answer, in my view, comes down to three main points.

Simplicity: LL(1) grammars are usually easier to write (and, arguably, read) than more complex grammars.
Performance: LL(1) parsing does not require backtracking (because backtracking is impossible). This means that parsing can only ever proceed in the forward direction, meaning that parsing has linear (i.e: O(n)) time complexity. Compare this to other parsing techniques, for which exponential time is virtually the norm.
Error recovery: Because parsers of LL(1) grammars do not perform backtracking, there are fewer potential parse paths to consider: this makes it more likely that errors can be recovered from without misinterpreting syntax and makes it more likely that errors relate to the syntax that the user is attempting to write.