#set text(font: "Noto Sans")
#show raw: body => box(stroke: black.transparentize(75%), inset: (left: 2pt, right: 2pt), outset: (top: 2pt, bottom: 3pt), radius: 1pt, body)
#let yes = text(font: "JetBrains Mono", sym.checkmark)
#let no = text(font: "JetBrains Mono", sym.crossmark)
#align(horizon + center)[
#set heading(outlined: false)
= TECTA PEG
The TECTA Parser Expression Grammar Descriptor Language
]
#pagebreak()
#outline()
#pagebreak()
#set heading(numbering: "1.")
= Overview
TECTA PEG is a token-based parser expression grammar descriptor language. It is purpose-built to create syntactical rules for TECTA, and is such opinionated in places --- this reference however exists to describe the parser and its characteristics for those willing to adopt its scheme.
TECTA PEG is designed to be implemented with a push-down automaton.
= Definitions
- Character: A Unicode scalar (non-surrogate codepoint).
- Basic identifier: A sequence of characters that matches the POSIX regex `[a-zA-Z_][0-9a-zA-Z_]*`.
- Punctuation character: One of `~!@$%^&*-=+\|;:,<.>/?`.
- String contents: A sequence of characters that does not contain some contextualised delimiter, and can contain some number of escape sequences (`\` followed by a character sequence).
- Stringed identifier: String contents delimited by backticks (#raw("`")).
- String literal: String contents delimited by quotation marks (`"`).
- Identifier: A basic identifier or stringed identifier.
- Group delimiter: Parentheses (`()`), brackets (`[]`), or braces (`{}`).
- Keyword: A basic identifier with syntactical significance. There are two variants:
- Soft keywords: Keywords can appear in place of an identifier. For example, if `fun` is a keyword, and a rule expects an identifier, that rule will successfully parse the text `fun`.
- Hard keywords: Keywords cannot appear in place of an identifier. Note that they can be escaped with stringed identifiers, as those are never interpreted as keywords; if `fun` is declared as a keyword, #raw("`fun`") can be used as an identifier.
- Token: A group of characters that has semantic value; the leaf of a PEG rule. One of:
- A group of tokens surrounded by a group delimiter.
- An integer literal with an optional suffix.
- A floating-point literal.
- A string literal.
- An identifier.
- A keyword.
- Raw push: A pushing action that does not follow the special behaviour described in @special-stack-behaviour.
- Raw pop: A popping action that does not follow the special behaviour described in @special-stack-behaviour.
= Modules
= Preambles
A preamble is denoted by an octothorpe (`#`) followed by a name. The syntax for a preamble is defined by the preamble itself, but they terminate with semicolons.
= The Grammar Stack
TECTA PEG works on a stack of grammar units. There are two types of grammar: control and rule. A rule grammar is what actually parses things and defines syntax, while control grammars are used for delimiting, such as grouping rules together.
= Rules
A grammar definition consists of a basic identifier, followed by an equal sign, followed by grammar contents. Grammar contents are a whitespace-separated sequence of grammar stack modifiers. An initially empty stack is created before parsing grammar contents, and each modifier is parsed in sequence, modifying the grammar stack. Once a grammar terminator (`;`) is parsed, the stack must consist entirely of rule grammars. The stack of rules now acts akin to a sequence rule, parsed bottom-to-top.
Rules, their modifiers, and preambles, are defined below.
== Punctuation
A punctuation rule is a sequence of punctuation characters surrounded by apostrophes/single quotes. This parses a sequence of punctuation tokens that have no whitespace next to each other.
For example, `'<<'` will push the `<<` punctuation rule onto the stack. It will parse two opening angle brackets next to each other. The text `<<` parses, but `< <` does not, as they are separated by whitespace (they are not *fused*).
== Keyword
A keyword rule is a basic identifier surrounded by quotation marks. This rule parses one keyword.
For example, `"fun"` will push the `fun` keyword rule onto the stack.
=== The `#keywords` preamble
To actually use keywords, they need to be specified beforehand.
The `#keywords` preamble specifies a set of keywords. It is followed by either `hard` or `soft` --- whether or not this keyword list is hard or soft --- a colon (`:`), and a whitespace-separated sequence of basic identifiers --- the list of keywords.
The hard and soft keywords lists are two separate lists, and if both are defined, must be disjoint. By default, no keywords are specified in either list.
== Other
An "other" rule consists of the name (basic identifier) of a different rule. It parses the rule that name refers to.
For example, `expr` will push a reference to the `expr` rule onto the stack, and will parse whatever `expr` parses.
== Builtin
A builtin rule consists of an `@` followed by a name. This denotes a rule that is built-in to the parser, most of which are basic tokens:
- `@name`: A name token.
- `@int_literal`: An integer literal (which can include length or other suffixes).
- `@natural_literal`: A integer literal without a suffix and greater than or equal to zero.
- `@float_literal`: A floating-point literal.
- `@token`: Any token.
- `@punct`: Any sequence of punctuation not separated by whitespace. Greedy; will accept punctuation until failure.
- `@single_punct`: Any one punctuation token.
For example, `@name` will push the `@name` builtin rule onto the stack, and will parse any name.
== Group
A group rule is a rule inside a pair of group delimiters.
More granularly, an opening group delimiter (one of `(`, `[`, or `{`) raw-pushes a _group start_ control, and a closing group delimiter (one of `)`, `]`, or `}`) will raw-pop the stack until it comes across a group start control. Any rules popped are made into a sequence. It is an error for any control `Peg` that is not a group start control to be popped before the group start control, and it is an error if the delimiter of the happened-upon group start control does not match the closing group delimiter. For example, `( < )` is an error, as is `[this)`.
For example, `("foo" "bar")` will parse a group token - which must be delimited by parentheses - and will then match the sequence `<"foo" "bar">` using the tokens inside the group. Once parsing of the inner rule is finished, no extra tokens may appear in the group; the text `(foo bar)` parses successfully, while `(foo bar baz)` does not.
== Repetition
A repetition rule consists of a main element rule and a separator rule.
To construct a group, `*` and `+` are used. Both pop a separator rule, followed by the main element, and construct a repetition rule. The difference is that `+` must have at least one main element, while `*` can have zero.
For example, `@name "and" +` pushes the rule that parses a sequence of names separated by the keyword `and`. At least one name must appear in the sequence; the text `x and y and z "foo"` by the previous definition parses successfully, as does `x "foo"`, however `"foo"` itself does not parse due to the imposed requirement of `+`. `@name "and" *` will successfully parse `"foo"`; it simply parses no elements.
=== Trailing
By default, the separator rule must be followed by a main element; separators are not allowed to trail. To enable trailing, the `~` modifier can be used. This pops a repetition rule, enables trailing, and pushes it back.
For example, `@name ',' +~` parses a sequence of names separated by commas, with trailing enabled; `x, y, z` parses sucessfully, as does `x, y,`.
=== Separatorless
A sequence without a separator element can be created with the composed modifier sequence `.*~`, which in total, pops a single rule and pushes a single rule (`*` can be substituted with `+`):
- `.` pushes an empty sequence; this parses nothing and always succeeds.
- `*` pops the empty sequence and the previous rule, creating a repetition rule.
- As the empty sequence always suceeds, the repetition will *always* trail. However, since trailing is disallowed by default, `~` is used to enable this. Without the `~`, the repetition rule will always expect more and will hence always fail.
== Optional
An optional rule is a rule that is allowed to fail. If it fails, it is simply not parsed and is not present.
An optional rule is created with `?`. This pops a rule, makes it optional, and pushes it back.
== Sequence
A sequence rule is a group of other rules; it parses each rule in *sequence*. Sequences are also used for boxing and grouping.
For example, `<foo bar baz>` will first try to parse `foo`, and if it succeeds, it will try to parse `bar` wherever `foo` ended. This is repeated with `baz` at wherever `bar` ended. If at any point one fails, the entire rule fails.
A sequence is started with `<`, and is ended with `>`. Any rules in between are part of the sequence.
More granularly, `<` raw-pushes a _sequence start_ control. `>` raw-pops the stack until it comes across a sequence start control. Any rules popped in between are added to a list of rules, which is the list of rules the sequence consists of. It is an error for any other control `Peg` that is not a sequence start control to be popped before the sequence start control; `< ( >` is an error.
`.` is equivalent to `<>`; pushing an empty sequence.
=== Grouping
A sequence may have only one rule, and it is useful as a general rule grouper. For example, `<this | that>` is effective.
=== Recursion Caveat & Boxing
For recursively-defined rules, there are certain configurations which implementations are allowed to not accept. The strongly-typed reference implementation, for example, does not accept the following:
```rs
a = b c
b = a d
```
For an implementation that generates parsing structures for rules, it may create the `a` rule which contains a field of type `b`. This `b` type may have a field `a`, incurring an infinitely sized type. However, there are ways to circumvent this. Generally, only top-level rules are affected. A rule can be "boxed" by wrapping it with angle brackets. This creates a single-element sequence, which is allowed:
```rs
a = b c
b = <a> d
```
== Choice
A choice rule is also a group of other rules; it parses the first rule that succeeds.
A choice rule is a list of rules separated by `|`. For example, `foo | bar | baz` will first try to parse `foo`, and if it fails, will try to parse `bar` at the same position it started parsing `foo` with. This is repeated with `baz`. If at any point one succeeds, the entire rule succeeds.
More granularly, `|` first parses a single rule ahead of it, continuing to parse on a new stack until no control rules exist on that stack (ensuring sequences and groups are parsed in full), producing a single parsed rule. Then, it will raw-pop the stack and do one of two things:
- If a choice is popped, the rule that was parsed is added as a new variant and the choice is pushed back.
- If a control is popped or the stack is empty, an error is emitted.
- If any other rule is popped, rules will continually be popped until a control is popped (which is then pushed back) or the stack is empty. The rules popped will form a new sequence, and a choice with two variants - the new sequence, and the parsed-ahead rule - is created and pushed.
=== Precedence and Special Stack Behaviour <special-stack-behaviour>
Choices override the pushing and popping behaviours of the stack:
- When a choice is popped from the stack, it is instead the last variant of that choice that is popped instead of the entire choice.
- When the top of the stack is a choice, and a rule is pushed, the last variant is taken, and if it is a sequence, the rule is added to it, otherwise it creates a sequence of the existing variant and the new rule as the new final variant.
These rules essentially flip the precedence of choice from tight to loose; `foo bar | baz qux` is no longer parsed as `foo <bar | baz> qux`, and is instead parsed similar to `<foo bar> | <baz qux>`.