peginator 0.7.0

PEG parser generator for creating ASTs in Rust (runtime)
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
## Peginator syntax

The syntax is similar to [EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form),
except for small additions, the lack of a sequence marker, and `|` used for ordered choice
instead of `/`.

### Rules

A peginator grammar is a list of rules.

Rules look like this:

```ebnf
RuleName = expression;
```

Rule names are directly used as Rust type names, so they shouldn't conflict with standard
type names, keywords, or the Peginator built-in names.

Use PascalCase for rule names, if you can. Fully upper-case names are not special.

You can prefix rules with various [directives](#directives), the most important being
[@export](#export).

### Expressions

#### `expression1 expression2`

**Sequence**: To match two or more expression parts one after another. If any of the expressions fail
to match, the whole sequence expression will fail.

#### `expression1 | expression2`

**Ordered choice**: it will try to match expression1, and if it was unsuccessful,
it will try to match expression2. If none of them were successful, the choice expression
will be unsuccessful too.

**Precedence between choice and sequence**: Choice has the least precedence of all expressions, i.e. these two rules are the same:

```ebnf
Rule1 =  'a' 'b'  |  'c' 'd'
Rule1 = ('a' 'b') | ('c' 'd')
```

#### `[expression]`

**Optional**: Try to match the expression, but not fail if it does not match. If half the
expression matches, and there are field matches in there, the fields will not
be updated.

#### `{expression}`

**Closure**: Match an expression zero or more times, greedily.

#### `{expression}+`

**Positive closure**: Match an expression one or more times, greedily.

#### `!expression`

**Negative lookahead**: fail if expression matches, succeed and don't consume input if it doesn't.

It may not contain any field declarations.

It has higher precedence than sequences and choices, so these two rules are the same:

```ebnf
Rule1 = a !b c;
Rule2 = a (!b) c;
```

Mainly useful to preserve the last element of a sequentially matched pattern:

```ebnf
Rule = first:Item {!(Item $) middle:Item} last:Item;
```

Or to match any character but one:
```ebnf
Rule = '(' {!')' body:char} ')';
```

#### `&expression`

**Positive lookahead**: succeeds if expression matches, doesn't consume input.

It may not contain any field declarations.

Precedence is the same as the Negative lookahead.

#### `(expression)`

**Group**: match the expression. Used to clarify precedences.

#### `'string'` or `"string"`

**String or character literal**. There is no difference between the quote styles. If there
is only a single character between the quotes, an optimized character match
will be generated.

The following escape sequences are supported:
* `\n`, `\r`, `\t`, `\\`, `\'`, `\"`
* `\xXX` (converts to unicode between 0x80 and 0xFF)
* `\uXXXX` (exactly 4 hex digits)
* `\U00XXXXXX` (Exactly 8 hex digits, the first two must be `0`)
* `\u{XXXX}` (1 to 6 hex digits)

#### `i'string'` or `i"string"`

**Case insensitive string or character literal**. Same as a string or character literals,
except they match case-insensitively. The casing of the literal in the grammar does not
matter.

ASCII only currently. Please open a GitHub issue if you need unicode support.

#### `'a'..'z'`

**Character range** (inclusive).

The same escape sequences work as for strings.

#### `$`

**End of input**: fail if there are any unparsed characters left.

#### `>rule`

**Include** (a.k.a. "inline rule"): Include the rule body at this point. The referred
rule cannot be a `@char` or `@extern` rule.

The following two grammars will generate equivalent parser codes and types:

```ebnf
ComplexRule = >OtherRule { >OtherRule2 };

OtherRule = x:Number "," y:Number;
OtherRule2 = "(" a:Number "," b:Number ")";
```

```ebnf
ComplexRule = x:Number "," y:Number { "(" a:Number "," b:Number ")" };
```

See also Override rules, which overlaps this feature.

#### Rule match

Rules can be matched without recording their output, by simply using their name:

```ebnf
RuleName
```

This will match the rule, but not put it into any field, basically dropping any structure
it may contain.

There is no automatic functionality of e.g. putting these results in a tuple in peginator.

A special rule type is `char`, which matches exactly one utf-8 character.

### Fields

Fields, parts of the generated Rule `struct`s are written as part of any expression:

```ebnf
field_name:RuleName
```

This will create a field with `field_name`, and type `FieldType`. During parsing, the
`FieldType` rule will be matched.

If a field is optional in the grammar (e.g. it's part of an optional, or it
does not exist on all choice alternatives), it will be encapsulated in an
`Option`.

If a field appears multiple times (in the same choice arm) or is part of a
closure, it will be encapsulated in a `Vec`.

If a field appears multiple times with different types, an `enum` type with the various
choices will be created and automatically used. If all choices appear exactly one time,
a simple enum is created, if any of them can appear multiple times, a `Vec` of the enum
is created (allowing for heterogeneous collections)

Fields deep in the rule expressions will be flattened into the rule `struct` itself:

```ebnf
Rule = 'something' [ ':' { element:ElementRule ',' }+ ]
```

```ignore
struct Rule {
    element:Vec<ElementRule>,
}
```

`char` can be used as a field type, in this case the field will have the type `char`, and
will match exactly one utf-8 character.

#### Override

Override declarations are similar to actual field declarations:
```ebnf
@:RuleType
```

This will match `RuleType`, and will directly assign the result to the Rule itself.

If there is only one override field, the rule's type will be a type alias to the
referred rule. If you want to force a new type, use the Include operator.

Override declarations can appear with multiple different types on different arms of
choices. In this case, the Rule will become an `enum` declaration:

```ebnf
EnumRule = @:Type1 | @:Type2 | 'prefix' @:Type3;
```

Overrides must appear exactly once on all arms of choice declarations, and cannot be
part of optional or closure construct.

`char` can be used as the type, in this case the rule will have the type `char`.

#### Boxing

To `Box` the field, prefix the type name with `*` (useful if there is a cycle in the types,
or the matched field is rare and big):

```ebnf
boxed_field:*BigRule
```

In case of `enum` fields and overrides, you can pick and choose which variants to box.
The inner part of the choice will be boxed, even if every variant is boxed.

In case of `Vec` fields, the field will become a vector of boxes (i.e. `Vec<Box<...>>`).

### Directives

#### `@export`

Exports the rule as a parse root, i.e. implements the [PegParser] traits on it.

#### `@no_skip_ws`

Disables whitespace skipping for the rule. It does not disable skipping in the internals
of matched rules.

#### `@position`

Record the start and end positions (byte indexes) of the rule match in the Rule `struct`.

The position is recorded in a field named `position`, but you can also access it with
the trait function `position` if you use the [PegPosition] trait.

If the rule is marked `@string`, a struct will be generated with two fields: `string` and
`position`.

If the rule is has multiple overrides (i.e. it is an enum rule), all variants have to be
marked with `@position`, and a special `impl` will be generated that returns the parsed
variant's position.

#### `@string`

Force the rule to be a `String`. All field declarations will be ignored, and the whole match
will be recorded.

This is functionality is meant to replace the use of regular expression matches.

Should be used with `@no_skip_ws`

#### `@char`

Force the rule to be a `char`. These are special kind of rules, which shall not have any other
directives, and can only contain a single choice, where all arms are simple characters, character
ranges or char rule matches.

Useful for defining character classes:
```ebnf
@char
Hexadecimal = 'a'..'f' | 'A'..'F' | '0'..'9';

@char
IdentifierChar =  Hexadecimal | '_';
```

Cannot be combined with other directives except `@check`.

#### `@memoize`

Enable memoization for the rule. It has a non-trivial performance impact (good or bad), so only use
when it's actually needed. (Mainly when having multiple alternative rules with the same, complex
prefix and different postfix)

#### `@leftrec`

Enable left recursion support on the rule. This also enables memoization. This directive _will_ make
the rule parsing slower, because it will be evaluated at least twice.

Its main use is building left-associative parse trees.

Keep in mind that left recursion can be trickier than they seem, and the interactions with other
called left-recursive rules may be unintuitive (especially when combined with the `>` operator).
In these cases, it might be better to rewrite the left-recursive rules to a list format (e.g.
`num { '+' num }`, and create the associative tree in a post-processing step.

#### `@check(...)`

Add additional checks to a rule. The "parameter" of the `@check` directive is a function name that
will be called with the result of the rule. The function name should be fully qualified (i.e. start
with a crate name or `crate::`).

The function shall return a bool. True means the check was successful.

Multiple checks can be added to a single rule.

For example, the following rule:

```ebnf
@check(crate::checks::check_point)
Point = '(' x:Number ',' y:Number ')'
```

Will call the following function in the `checks` module:

```ignore
pub fn check_point(p: &Point) -> bool {
    ...
}
```

#### `@extern(...)`

Call an external parsing function. The rule must only have a name and no `=` and body.

If a result type is not given, String is assumed.

An `.into()` is called on the result of the function. This is useful if a function returns an &str,
but the result type is a String.

The function signature is

```ignore
fn parse(s: &str) -> Result<(T, usize), &'static str>;
```

In case of a successful parse, a tuple with the resulting object, and the number of bytes (!) consumed
from the input string shall be returned, wrapped in Ok(). It is important that the number of bytes is
returned, not the number of characters, they will be different in the presence of utf-8 characters.
Not doing so will misalign the parser, and could also panic.

In case of an unsuccessful parse, a static string shall be returned, describing the parse problem
(preferably in "expected X" form).


Examples:

```ebnf
# Assumed return (and rule) type is String
@extern(crate::parse_identifier)
Identifier;

# Explicit return type
@extern(crate::Point::parse -> crate::Point)
Point;
```

The above will call the following functions:

```ignore
// Note that the result can be both &str or String
fn parse_identifier(s: &str) -> Result<(&str, usize), &'static str>{
    ...
}

struct Point{...}

impl Point {
    pub fn parse(s: &str) -> Result<(Self, usize), &'static str>{
        ...
    }
}
```

Cannot be combined with other directives. To use other directives anyway, wrap it into a normal rule:

```ebnf
@extern(crate::parse_hungarian_word)
HungarianWordExt;

@position
@memoize
HungarianWord = word:HungarianWordExt;
```

## Whitespace skipping

By default, peginator will skip ASCII whitespaces before every rule match, field, override, literal,
character range or end of input match.

This can be disabled on a rule basis with the `@no_skip_ws` directive.

Be sure to use `@no_skip_ws` on matched subrules too.

#### The `Whitespace` rule

The whitespace matching is just a built-in rule named `Whitespace`. It can be used just like regular
rule, which is useful for `@no_skip_ws` rules:

```ebnf
@no_skip_ws
Something = 'Sum:' Whitespace [prefix:Prefix] num:Number;
```

It can be overridden too. Comment skipping functionality can be implemented this way:

```ebnf
@no_skip_ws
Whitespace = {Comment |  '\t' | '\n' | '\x0C' | '\r' | ' '};

@no_skip_ws
Comment = '#' {!'\n' char} '\n';
```

Be sure to use `@no_skip_ws` on the `Whitespace` rule, and all rules it calls, or else the code will
most likely run into infinite recursion.

[PegParser]: https://docs.rs/peginator/latest/peginator/trait.PegParser.html
[PegPosition]: https://docs.rs/peginator/latest/peginator/trait.PegPosition.html