#set text(font: "Noto Sans")
#show raw: body => box(stroke: black.transparentize(75%), inset: (left: 2pt, right: 2pt), outset: (top: 2pt, bottom: 3pt), radius: 1pt, body)
#let yes = text(font: "JetBrains Mono", sym.checkmark)
#let no = text(font: "JetBrains Mono", sym.crossmark)
#align(horizon + center)[
#set heading(outlined: false)
= TECTA PEG
The TECTA Parser Expression Grammar Descriptor Language
]
#pagebreak()
#outline()
#pagebreak()
#set heading(numbering: "1.")
= Overview
TECTA PEG is a token-based parser expression grammar descriptor language. It is purpose-built to create syntactical rules for TECTA, and is such opinionated in places --- this reference however exists to describe the parser and its characteristics for those willing to adopt its scheme.
TECTA PEG is designed to be implemented with a push-down automaton.
= Definitions
- Character: A Unicode scalar (non-surrogate codepoint).
- Basic identifier: A sequence of characters that matches the POSIX regex `[a-zA-Z_][0-9a-zA-Z_]*`.
- Punctuation character: One of `~!@$%^&*-=+\|;:,<.>/?`.
- String contents: A sequence of characters that does not contain some contextualised delimiter, and can contain some number of escape sequences (`\` followed by a character sequence).
- Stringed identifier: String contents delimited by backticks (#raw("`")).
- String literal: String contents delimited by quotation marks (`"`).
- Identifier: A basic identifier or stringed identifier.
- Group delimiter: Parentheses (`()`), brackets (`[]`), or braces (`{}`).
- Keyword: A basic identifier with syntactical significance. There are two variants:
- Soft keywords: Keywords can appear in place of an identifier. For example, if `fun` is a keyword, and a rule expects an identifier, that rule will successfully parse the text `fun`.
- Hard keywords: Keywords cannot appear in place of an identifier. Note that they can be escaped with stringed identifiers, as those are never interpreted as keywords; if `fun` is declared as a keyword, #raw("`fun`") can be used as an identifier.
- Token: A group of characters that has semantic value; the leaf of a PEG rule. One of:
- A group of tokens surrounded by a group delimiter.
- An integer literal with an optional suffix.
- A floating-point literal.
- A string literal.
- An identifier.
- A keyword.
- Raw push: A pushing action that does not follow the special behaviour described in @special-stack-behaviour.
- Raw pop: A popping action that does not follow the special behaviour described in @special-stack-behaviour.
= Modules
= Preambles
A preamble is denoted by an octothorpe (`#`) followed by a name. The syntax for a preamble is defined by the preamble itself, but they terminate with semicolons.
= The Grammar Stack
TECTA PEG works on a stack of grammar units. There are two types of grammar: control and rule. A rule grammar is what actually parses things and defines syntax, while control grammars are used for delimiting, such as grouping rules together.
= Rules
A grammar definition consists of a basic identifier, followed by an equal sign, followed by grammar contents. Grammar contents are a whitespace-separated sequence of grammar stack modifiers. An initially empty stack is created before parsing grammar contents, and each modifier is parsed in sequence, modifying the grammar stack. Once a grammar terminator (`;`) is parsed, the stack must consist entirely of rule grammars. The stack of rules now acts akin to a sequence rule, parsed bottom-to-top.
Rules, their modifiers, and preambles, are defined below.
== Punctuation
A punctuation rule is a sequence of punctuation characters surrounded by apostrophes/single quotes. This parses a sequence of punctuation tokens that have no whitespace next to each other.
For example, `'<<'` will push the `<<` punctuation rule onto the stack. It will parse two opening angle brackets next to each other. The text `<<` parses, but `< <` does not, as they are separated by whitespace (they are not *fused*).
== Keyword
A keyword rule is a basic identifier surrounded by quotation marks. This rule parses one keyword.
For example, `"fun"` will push the `fun` keyword rule onto the stack.
=== The `#keywords` preamble
To actually use keywords, they need to be specified beforehand.
The `#keywords` preamble specifies a set of keywords. It is followed by either `hard` or `soft` --- whether or not this keyword list is hard or soft --- a colon (`:`), and a whitespace-separated sequence of basic identifiers --- the list of keywords.
The hard and soft keywords lists are two separate lists, and if both are defined, must be disjoint. By default, no keywords are specified in either list.
== Other
An "other" rule consists of the name (basic identifier) of a different rule. It parses the rule that name refers to.
For example, `expr` will push a reference to the `expr` rule onto the stack, and will parse whatever `expr` parses.
== Builtin
A builtin rule consists of an `@` followed by a name. This denotes a rule that is built-in to the parser, most of which are basic tokens:
- `@name`: A name token.
- `@int_literal`: An integer literal (which can include length or other suffixes).
- `@natural_literal`: A integer literal without a suffix and greater than or equal to zero.
- `@float_literal`: A floating-point literal.
- `@token`: Any token.
- `@punct`: Any sequence of punctuation not separated by whitespace. Greedy; will accept punctuation until failure.
- `@single_punct`: Any one punctuation token.
For example, `@name` will push the `@name` builtin rule onto the stack, and will parse any name.
== Group
A group rule is a rule inside a pair of group delimiters.
More granularly, an opening group delimiter (one of `(`, `[`, or `{`) raw-pushes a _group start_ control, and a closing group delimiter (one of `)`, `]`, or `}`) will raw-pop the stack until it comes across a group start control. Any rules popped are made into a sequence. It is an error for any control `Peg` that is not a group start control to be popped before the group start control, and it is an error if the delimiter of the happened-upon group start control does not match the closing group delimiter. For example, `( < )` is an error, as is `[this)`.
For example, `("foo" "bar")` will parse a group token - which must be delimited by parentheses - and will then match the sequence `<"foo" "bar">` using the tokens inside the group. Once parsing of the inner rule is finished, no extra tokens may appear in the group; the text `(foo bar)` parses successfully, while `(foo bar baz)` does not.
== Repetition
A repetition rule consists of a main element rule and a separator rule.
To construct a group, `*` and `+` are used. Both pop a separator rule, followed by the main element, and construct a repetition rule. The difference is that `+` must have at least one main element, while `*` can have zero.
For example, `@name "and" +` pushes the rule that parses a sequence of names separated by the keyword `and`. At least one name must appear in the sequence; the text `x and y and z "foo"` by the previous definition parses successfully, as does `x "foo"`, however `"foo"` itself does not parse due to the imposed requirement of `+`. `@name "and" *` will successfully parse `"foo"`; it simply parses no elements.
=== Trailing
By default, the separator rule must be followed by a main element; separators are not allowed to trail. To enable trailing, the `~` modifier can be used. This pops a repetition rule, enables trailing, and pushes it back.
For example, `@name ',' +~` parses a sequence of names separated by commas, with trailing enabled; `x, y, z` parses sucessfully, as does `x, y,`.
=== Separatorless
A sequence without a separator element can be created with the composed modifier sequence `.*~`, which in total, pops a single rule and pushes a single rule (`*` can be substituted with `+`):
- `.` pushes an empty sequence; this parses nothing and always succeeds.
- `*` pops the empty sequence and the previous rule, creating a repetition rule.
- As the empty sequence always suceeds, the repetition will *always* trail. However, since trailing is disallowed by default, `~` is used to enable this. Without the `~`, the repetition rule will always expect more and will hence always fail.
== Optional
An optional rule is a rule that is allowed to fail. If it fails, it is simply not parsed and is not present.
An optional rule is created with `?`. This pops a rule, makes it optional, and pushes it back.
== Sequence
A sequence rule is a group of other rules; it parses each rule in *sequence*. Sequences are also used for grouping.
For example, `<foo bar baz>` will first try to parse `foo`, and if it succeeds, it will try to parse `bar` wherever `foo` ended. This is repeated with `baz` at wherever `bar` ended. If at any point one fails, the entire rule fails.
A sequence is started with `<`, and is ended with `>`. Any rules in between are part of the sequence.
More granularly, `<` raw-pushes a _sequence start_ control. `>` raw-pops the stack until it comes across a sequence start control. Any rules popped in between are added to a list of rules, which is the list of rules the sequence consists of. It is an error for any other control `Peg` that is not a sequence start control to be popped before the sequence start control; `< ( >` is an error.
`.` is equivalent to `<>`; pushing an empty sequence.
Sequences can be *weak* or *strong*. Weak sequences are those created implicitly, and strong ones are those created via `<...>` or `.` syntax.
When popping, a weak sequence will give up only its last rule, while a strong sequence will give itself entirely. See @special-stack-behaviour.
=== Grouping
A sequence may have only one rule, and it is useful as a general rule grouper. For example, `<this | that>` is effective.
== Record
A record rule is parsed similarly to a sequence rule, but each piece of the sequence can be given a name.
Record rules start with `&{` and are followed by a sequence of fields terminated by `}`. A field is a basic identifier or a bang (`!`), a colon, and grammar contents terminated by a semicolon. If the field name is a bang, the rule is parsed, but the results are discarded. For example, the rule `&{ name: @name; !: ':'; type: @name; }` parses the text `x: int`, and produces a 2-field record of `name` and `type`. The colon is parsed --- `type` parses the text `int` --- but it is not held, as the field name is a bang.
== Choice
A choice rule is also a group of other rules; it parses the first rule that succeeds.
A choice rule is a list of rules separated by `|`. For example, `foo | bar | baz` will first try to parse `foo`, and if it fails, will try to parse `bar` at the same position it started parsing `foo` with. This is repeated with `baz`. If at any point one succeeds, the entire rule succeeds.
More granularly, `|` first parses a single rule ahead of it, continuing to parse on a new stack until no control rules exist on that stack (ensuring sequences and groups are parsed in full), producing a single parsed rule. Then, it will raw-pop the stack and do one of two things:
- If a choice is popped, the rule that was parsed is added as a new variant and the choice is pushed back.
- If a control is popped or the stack is empty, an error is emitted.
- If any other rule is popped, rules will continually be popped until a control is popped (which is then pushed back) or the stack is empty. The rules popped will form a new sequence, and a choice with two variants - the new sequence, and the parsed-ahead rule - is created and pushed.
== Named Choice
A named choice is functionally identical to a choice, but like a record, each _variant_ can be named. Named choices start with `&[`, and are followed by a sequence of fields terminated by `]`. These fields are the same as record fields, but eliding the name is disallowed (no `!`).
Implementations are allowed to deny named choices if they are not at the top level or if multiple rules in sequence have been given for a rule. For example, the rule definition `a = &{ !: "fun" } | @name;` may be rejected. Definitions such as `a = &{ !: "fun" } @name;` can be rejected as well.
#pagebreak()
== Boxing and Cyclicity
A boxed rule does not functionally affect a rule, but it allows for cyclic rule definitions. While implementations _may_ accept cyclic rules, strongly-typed code generation may have issues.
Take the following TECTA PEG:
```
a = b;
b = a;
```
A code generator may generate items as such:
```rs
struct a(b);
struct b(a);
```
`a` contains a single field, `b`, which contains a single field `a`, completing a cycle.
A boxed rule is created with `^`. This pops a rule, boxes it, and places it back on the stack.
With boxed rules, you can circumvent the above cyclicity issue:
```
a = b;
b = a^;
```
All implementations must accept this.
This may emit code now similar to:
```rs
struct a(b);
struct b(box<a>);
```
Now, although `a` holds `b`, `b`'s `a` is instead heap-allocated or passed through some level of indirection such that `a` does not affect the layout of `b`, destroying the type-level cyclicity but usefully keeping it at the rule level.
== Precedence and Special Stack Behaviour of Choices <special-stack-behaviour>
Choices override the pushing and popping behaviours of the stack:
- When a choice is popped from the stack, it is instead the last variant of that choice that is popped instead of the entire choice. An empty weak sequence takes that variants place if it was not already a weak sequence.
- When the top of the stack is a choice, and a rule is pushed, the last variant is taken, and if it is a weak sequence, the rule is added to it, otherwise it creates a weak sequence of the existing variant and the new rule as the new final variant.
These rules essentially flip the precedence of choice from tight to loose; `foo bar | baz qux` is no longer parsed as `foo <bar | baz> qux`, and is instead parsed similar to `<foo bar> | <baz qux>`.