Crate peginator

source ·
Expand description

§Table of contents

§Description

PEG parser generator for creating ASTs in Rust

Peginator is a PEG (Parsing Expression Grammar) parser generator written in Rust. It is specifically made to parse into ASTs (Abstract Syntax Trees), as opposed to most, streaming-style parsers out there.

It generates both the tree structure and the parsing code that can create that tree from any &str. The generated parsing code is deliberately very simple straightforward Rust code, which is usually optimized very well by the compiler.

There is an opt-in memoization feature that makes it a proper packrat parser that can parse any input in linear time and space.

Left-recursion is also supported using said memoization feature (also opt-in).

§Crate structure

The crates in peginator are:

  • peginator: The runtime crate used by generated code and the crates calling it. It should be a “normal” dependency of all peginator-using code.
  • peginator_codegen: Helper crate with reduced dependencies for direct buildscript integration. It should be a build-depenednecy if used.
  • peginator_macro: Macro support for small grammars directly included in code.
  • peginator_cli: A binary crate providing the peginator-cli cli command.
§About PEGs

This documentation describes how peginator implements PEGs. A basic understanding of PEGs are assumed. There are good introductions on wikipedia or in the docs of other parser generators.

Peginator is bootstrapped using its own syntax and grammar file, which is somewhat easy-to-read.

§Quickstart

The grammars for peginator are written in a syntax similar to EBNF (extended Backus-Naur form):

@export
FunctionDef = 'fn' name:Ident '(' param_list:ParamList ')' [ '->' return_value:Type ];

ParamList = self_param:SelfParam {',' params:Param} | params:Param  {',' params:Param} | ;

Param = name:Ident ':' typ: Type;

SelfParam = [ref_type:ReferenceMarker] 'self';

Type = [ref_type:ReferenceMarker] typename:Ident;

ReferenceMarker = @:MutableReference | @:ImmutableReference;

ImmutableReference = '&';
MutableReference = '&' 'mut';

@string
@no_skip_ws
Ident = {'a'..'z' | 'A'..'Z' | '_' | '0'..'9'};

Based on the above grammar, peginator will generate the following types:

pub struct FunctionDef {
    pub name: Ident,
    pub param_list: ParamList,
    pub return_value: Option<Type>,
}
pub struct ParamList {
    pub self_param: Option<SelfParam>,
    pub params: Vec<Param>,
}
pub struct Param {
    pub name: Ident,
    pub typ: Type,
}
pub struct SelfParam {
    pub ref_type: Option<ReferenceMarker>,
}
pub struct Type {
    pub ref_type: Option<ReferenceMarker>,
    pub typename: Ident,
}
pub enum ReferenceMarker {
    ImmutableReference(ImmutableReference),
    MutableReference(MutableReference),
}
pub struct ImmutableReference;
pub struct MutableReference;
pub type Ident = String;

impl PegParser for FunctionDef { /* omitted */ }

Parsing then looks like this:

FunctionDef::parse("fn example(&self, input:&str, rectified:&mut Rect) -> ExampleResult;")

Which results in the folowing structure:

FunctionDef {
    name: "example",
    param_list: ParamList {
        self_param: Some(SelfParam {
            ref_type: Some(ImmutableReference(ImmutableReference)),
        }),
        params: [
            Param {
                name: "input",
                typ: Type {
                    ref_type: Some(ImmutableReference(ImmutableReference)),
                    typename: "str",
                },
            },
            Param {
                name: "rectified",
                typ: Type {
                    ref_type: Some(MutableReference(MutableReference)),
                    typename: "Rect",
                },
            },
        ],
    },
    return_value: Some(Type {
        ref_type: None,
        typename: "ExampleResult",
    }),
}

§Integration

To use peginator grammars in your project, you should write the grammar and save it in the project directory with the .ebnf extension.

Then run the peginator-cli command (from the peginator_cli crate) on it, and (optionally) rustfmt:

peginator-cli your_grammar.ebnf | rustfmt >src/grammar.rs

Make sure you also add the peginator crate as a dependency in your Cargo.toml:

[dependencies]
peginator = "0.6"

Once you have compiled your grammar, you can import the types, and the PegParser trait, and you can start parsing strings:

use crate::grammar::YourRootType;
use peginator::PegParser;

let parse_result = YourRootType::parse("some string (maybe?)");
println!("{:?}", parse_result);

Alternatively, you can use a buildscript using the peginator_codegen crate, or a macro using the peginator_macro crate.

§Features and optional dependencies

The following features can be enabled:

  • colored (on by default): Adds a dependency to the colored crate, and makes error reports and traces colorful.

§Peginator syntax

The syntax is similar to EBNF, except for small additions, the lack of a sequence marker, and | used for ordered choice instead of /.

§Rules

A peginator grammar is a list of rules.

Rules look like this:

RuleName = expression;

Rule names are directly used as Rust type names, so they shouldn’t conflict with standard type names, keywords, or the Peginator built-in names.

Use PascalCase for rule names, if you can. Fully upper-case names are not special.

You can prefix rules with various directives, the most important being @export.

§Expressions

§expression1 expression2

Sequence: To match two or more expression parts one after another. If any of the expressions fail to match, the whole sequence expression will fail.

§expression1 | expression2

Ordered choice: it will try to match expression1, and if it was unsuccessful, it will try to match expression2. If none of them were successful, the choice expression will be unsuccessful too.

Precedence between choice and sequence: Choice has the least precedence of all expressions, i.e. these two rules are the same:

Rule1 =  'a' 'b'  |  'c' 'd'
Rule1 = ('a' 'b') | ('c' 'd')
§[expression]

Optional: Try to match the expression, but not fail if it does not match. If half the expression matches, and there are field matches in there, the fields will not be updated.

§{expression}

Closure: Match an expression zero or more times, greedily.

§{expression}+

Positive closure: Match an expression one or more times, greedily.

§!expression

Negative lookahead: fail if expression matches, succeed and don’t consume input if it doesn’t.

It may not contain any field declarations.

It has higher precedence than sequences and choices, so these two rules are the same:

Rule1 = a !b c;
Rule2 = a (!b) c;

Mainly useful to preserve the last element of a sequentially matched pattern:

Rule = first:Item {!(Item $) middle:Item} last:Item;

Or to match any character but one:

Rule = '(' {!')' body:char} ')';
§&expression

Positive lookahead: succeeds if expression matches, doesn’t consume input.

It may not contain any field declarations.

Precedence is the same as the Negative lookahead.

§(expression)

Group: match the expression. Used to clarify precedences.

§'string' or "string"

String or character literal. There is no difference between the quote styles. If there is only a single character between the quotes, an optimized character match will be generated.

The following escape sequences are supported:

  • \n, \r, \t, \\, \', \"
  • \xXX (converts to unicode between 0x80 and 0xFF)
  • \uXXXX (exactly 4 hex digits)
  • \U00XXXXXX (Exactly 8 hex digits, the first two must be 0)
  • \u{XXXX} (1 to 6 hex digits)
§i'string' or i"string"

Case insensitive string or character literal. Same as a string or character literals, except they match case-insensitively. The casing of the literal in the grammar does not matter.

ASCII only currently. Please open a GitHub issue if you need unicode support.

§'a'..'z'

Character range (inclusive).

The same escape sequences work as for strings.

§$

End of input: fail if there are any unparsed characters left.

§>rule

Include (a.k.a. “inline rule”): Include the rule body at this point. The referred rule cannot be a @char or @extern rule.

The following two grammars will generate equivalent parser codes and types:

ComplexRule = >OtherRule { >OtherRule2 };

OtherRule = x:Number "," y:Number;
OtherRule2 = "(" a:Number "," b:Number ")";
ComplexRule = x:Number "," y:Number { "(" a:Number "," b:Number ")" };

See also Override rules, which overlaps this feature.

§Rule match

Rules can be matched without recording their output, by simply using their name:

RuleName

This will match the rule, but not put it into any field, basically dropping any structure it may contain.

There is no automatic functionality of e.g. putting these results in a tuple in peginator.

A special rule type is char, which matches exactly one utf-8 character.

§Fields

Fields, parts of the generated Rule structs are written as part of any expression:

field_name:RuleName

This will create a field with field_name, and type FieldType. During parsing, the FieldType rule will be matched.

If a field is optional in the grammar (e.g. it’s part of an optional, or it does not exist on all choice alternatives), it will be encapsulated in an Option.

If a field appears multiple times (in the same choice arm) or is part of a closure, it will be encapsulated in a Vec.

If a field appears multiple times with different types, an enum type with the various choices will be created and automatically used. If all choices appear exactly one time, a simple enum is created, if any of them can appear multiple times, a Vec of the enum is created (allowing for heterogeneous collections)

Fields deep in the rule expressions will be flattened into the rule struct itself:

Rule = 'something' [ ':' { element:ElementRule ',' }+ ]
struct Rule {
    element:Vec<ElementRule>,
}

char can be used as a field type, in this case the field will have the type char, and will match exactly one utf-8 character.

§Override

Override declarations are similar to actual field declarations:

@:RuleType

This will match RuleType, and will directly assign the result to the Rule itself.

If there is only one override field, the rule’s type will be a type alias to the referred rule. If you want to force a new type, use the Include operator.

Override declarations can appear with multiple different types on different arms of choices. In this case, the Rule will become an enum declaration:

EnumRule = @:Type1 | @:Type2 | 'prefix' @:Type3;

Overrides must appear exactly once on all arms of choice declarations, and cannot be part of optional or closure construct.

char can be used as the type, in this case the rule will have the type char.

§Boxing

To Box the field, prefix the type name with * (useful if there is a cycle in the types, or the matched field is rare and big):

boxed_field:*BigRule

In case of enum fields and overrides, you can pick and choose which variants to box. The inner part of the choice will be boxed, even if every variant is boxed.

In case of Vec fields, the field will become a vector of boxes (i.e. Vec<Box<...>>).

§Directives

§@export

Exports the rule as a parse root, i.e. implements the PegParser traits on it.

§@no_skip_ws

Disables whitespace skipping for the rule. It does not disable skipping in the internals of matched rules.

§@position

Record the start and end positions (byte indexes) of the rule match in the Rule struct.

The position is recorded in a field named position, but you can also access it with the trait function position if you use the PegPosition trait.

If the rule is marked @string, a struct will be generated with two fields: string and position.

If the rule is has multiple overrides (i.e. it is an enum rule), all variants have to be marked with @position, and a special impl will be generated that returns the parsed variant’s position.

§@string

Force the rule to be a String. All field declarations will be ignored, and the whole match will be recorded.

This is functionality is meant to replace the use of regular expression matches.

Should be used with @no_skip_ws

§@char

Force the rule to be a char. These are special kind of rules, which shall not have any other directives, and can only contain a single choice, where all arms are simple characters, character ranges or char rule matches.

Useful for defining character classes:

@char
Hexadecimal = 'a'..'f' | 'A'..'F' | '0'..'9';

@char
IdentifierChar =  Hexadecimal | '_';

Cannot be combined with other directives except @check.

§@memoize

Enable memoization for the rule. It has a non-trivial performance impact (good or bad), so only use when it’s actually needed. (Mainly when having multiple alternative rules with the same, complex prefix and different postfix)

§@leftrec

Enable left recursion support on the rule. This also enables memoization. This directive will make the rule parsing slower, because it will be evaluated at least twice.

Its main use is building left-associative parse trees.

Keep in mind that left recursion can be trickier than they seem, and the interactions with other called left-recursive rules may be unintuitive (especially when combined with the > operator). In these cases, it might be better to rewrite the left-recursive rules to a list format (e.g. num { '+' num }, and create the associative tree in a post-processing step.

§@check(...)

Add additional checks to a rule. The “parameter” of the @check directive is a function name that will be called with the result of the rule. The function name should be fully qualified (i.e. start with a crate name or crate::).

The function shall return a bool. True means the check was successful.

Multiple checks can be added to a single rule.

For example, the following rule:

@check(crate::checks::check_point)
Point = '(' x:Number ',' y:Number ')'

Will call the following function in the checks module:

pub fn check_point(p: &Point) -> bool {
    ...
}
§@extern(...)

Call an external parsing function. The rule must only have a name and no = and body.

If a result type is not given, String is assumed.

An .into() is called on the result of the function. This is useful if a function returns an &str, but the result type is a String.

The function signature is

fn parse(s: &str) -> Result<(T, usize), &'static str>;

In case of a successful parse, a tuple with the resulting object, and the number of bytes (!) consumed from the input string shall be returned, wrapped in Ok(). It is important that the number of bytes is returned, not the number of characters, they will be different in the presence of utf-8 characters. Not doing so will misalign the parser, and could also panic.

In case of an unsuccessful parse, a static string shall be returned, describing the parse problem (preferably in “expected X” form).

Examples:

# Assumed return (and rule) type is String
@extern(crate::parse_identifier)
Identifier;

# Explicit return type
@extern(crate::Point::parse -> crate::Point)
Point;

The above will call the following functions:

// Note that the result can be both &str or String
fn parse_identifier(s: &str) -> Result<(&str, usize), &'static str>{
    ...
}

struct Point{...}

impl Point {
    pub fn parse(s: &str) -> Result<(Self, usize), &'static str>{
        ...
    }
}

Cannot be combined with other directives. To use other directives anyway, wrap it into a normal rule:

@extern(crate::parse_hungarian_word)
HungarianWordExt;

@position
@memoize
HungarianWord = word:HungarianWordExt;

§Whitespace skipping

By default, peginator will skip ASCII whitespaces before every rule match, field, override, literal, character range or end of input match.

This can be disabled on a rule basis with the @no_skip_ws directive.

Be sure to use @no_skip_ws on matched subrules too.

§The Whitespace rule

The whitespace matching is just a built-in rule named Whitespace. It can be used just like regular rule, which is useful for @no_skip_ws rules:

@no_skip_ws
Something = 'Sum:' Whitespace [prefix:Prefix] num:Number;

It can be overridden too. Comment skipping functionality can be implemented this way:

@no_skip_ws
Whitespace = {Comment |  '\t' | '\n' | '\x0C' | '\r' | ' '};

@no_skip_ws
Comment = '#' {!'\n' char} '\n';

Be sure to use @no_skip_ws on the Whitespace rule, and all rules it calls, or else the code will most likely run into infinite recursion.

Structs§

Enums§

Traits§

  • The main trait for interfacing with peginator. Implemented by @export-ed rules.
  • The trait that is actually implemented by the generated code. Should only be used in very specific cases, and PegParser should be used in all normal cases.
  • Helper trait to get the parse position of the parsed rule