tecta-peg 0.1.1

PDA-based TECTA PEG parser
Documentation
#set text(font: "Noto Sans")

#show raw: body => box(stroke: black.transparentize(75%), inset: (left: 2pt, right: 2pt), outset: (top: 2pt, bottom: 3pt), radius: 1pt, body)

#let yes = text(font: "JetBrains Mono", sym.checkmark)
#let no = text(font: "JetBrains Mono", sym.crossmark)

#align(horizon + center)[
    #set heading(outlined: false)
    = TECTA PEG
    The TECTA Parser Expression Grammar Descriptor Language
]

#pagebreak()
#outline()
#pagebreak()

#set heading(numbering: "1.")

= Overview
TECTA PEG is a token-based parser expression grammar descriptor language. It is purpose-built to create syntactical rules for TECTA, and is such opinionated in places --- this reference however exists to describe the parser and its characteristics for those willing to adopt its scheme.

TECTA PEG is designed to be implemented with a push-down automaton.

= Definitions
- Character: A Unicode scalar (non-surrogate codepoint).
- Basic identifier: A sequence of characters that matches the POSIX regex `[a-zA-Z_][0-9a-zA-Z_]*`.
- Punctuation character: One of `~!@$%^&*-=+\|;:,<.>/?`.
- String contents: A sequence of characters that does not contain some contextualised delimiter, and can contain some number of escape sequences (`\` followed by a character sequence).
- Stringed identifier: String contents delimited by backticks (#raw("`")).
- String literal: String contents delimited by quotation marks (`"`).
- Identifier: A basic identifier or stringed identifier.
- Group delimiter: Parentheses (`()`), brackets (`[]`), or braces (`{}`).
- Keyword: A basic identifier with syntactical significance. There are two variants:
    - Soft keywords: Keywords can appear in place of an identifier. For example, if `fun` is a keyword, and a rule expects an identifier, that rule will successfully parse the text `fun`.
    - Hard keywords: Keywords cannot appear in place of an identifier. Note that they can be escaped with stringed identifiers, as those are never interpreted as keywords; if `fun` is declared as a keyword, #raw("`fun`") can be used as an identifier.
- Token: A group of characters that has semantic value; the leaf of a PEG rule. One of:
    - A group of tokens surrounded by a group delimiter.
    - An integer literal with an optional suffix.
    - A floating-point literal.
    - A string literal.
    - An identifier.
    - A keyword.
- Raw push: A pushing action that does not follow the special behaviour described in @special-stack-behaviour.
- Raw pop: A popping action that does not follow the special behaviour described in @special-stack-behaviour.

= Modules

= Preambles
A preamble is denoted by an octothorpe (`#`) followed by a name. Any syntax

= The Grammar Stack
TECTA PEG works on a stack of grammar units. There are two types of grammar: control and rule. A rule grammar is what actually parses things and defines syntax, while control grammars are used for delimiting, such as grouping rules together.

= Rules
A grammar definition consists of a basic identifier, followed by an equal sign, followed by grammar contents. Grammar contents are a whitespace-separated sequence of grammar stack modifiers. An initially empty stack is created before parsing grammar contents, and each modifier is parsed in sequence, modifying the grammar stack. Once a grammar terminator (`;`) is parsed, the stack must consist entirely of rule grammars. The stack of rules now acts akin to a sequence rule, parsed bottom-to-top.

Rules, their modifiers, and preambles, are defined below.

== Punctuation
A punctuation rule is a sequence of punctuation characters surrounded by apostrophes/single quotes. This parses a sequence of punctuation tokens that have no whitespace next to each other.

For example, `'<<'` will push the `<<` punctuation rule onto the stack. It will parse two opening angle brackets next to each other. The text `<<` parses, but `< <` does not, as they are separated by whitespace (they are not *fused*).

== Keyword
A keyword rule is a basic identifier surrounded by quotation marks. This rule parses one keyword.

For example, `"fun"` will push the `fun` keyword rule onto the stack.

=== The `#keywords` preamble
The `#keywords` preamble specifies the "hardness" of keywords and optionally specifies them. It is followed by either `hard` or `soft`, which dictate whether or not keywords are hard or soft. This is then followed by a keyword list declarator, which comes in one of three forms:

- `auto`: The list of keywords is derived from their usage in rules; any presence of a keyword rule adds it to the keyword list. 
- `: <keywords>`, where `<keywords>` is a whitespace-separated sequence of basic identifiers: The list of keywords is specified beforehand. Any appearance in rules of keywords not in this list constitutes an error.
- `strict: <keywords>`, where `<keywords>` is a whitespace-separated sequence of basic identifiers: Similar to `\#keywords hard: \<keywords>`, but it is an error if any keywords in the list *aren't* used in a rule.

By default, `\#keywords soft auto` is the setting used.

== Other
An "other" rule consists of the name (basic identifier) of a different rule. It parses the rule that name refers to.

For example, `expr` will push a reference to the `expr` rule onto the stack, and will parse whatever `expr` parses.

== Builtin
A builtin rule consists of an `@` followed by a name. This denotes a rule that is built-in to the parser, most of which are basic tokens:
- `@name`: A name token.
- `@int_literal`: An integer literal (which can include length or other suffixes).
- `@natural_literal`: A integer literal without a suffix and greater than or equal to zero.
- `@float_literal`: A floating-point literal.
- `@token`: Any token.
- `@punct`: Any sequence of punctuation not separated by whitespace. Greedy; will accept punctuation until failure.
- `@single_punct`: Any one punctuation token.

For example, `@name` will push the `@name` builtin rule onto the stack, and will parse any name.

== Group
A group rule is a rule inside a pair of group delimiters.

More granularly, an opening group delimiter (one of `(`, `[`, or `{`) raw-pushes a _group start_ control, and a closing group delimiter (one of `)`, `]`, or `}`) will raw-pop the stack until it comes across a group start control. Any rules popped are made into a sequence. It is an error for any control `Peg` that is not a group start control to be popped before the group start control, and it is an error if the delimiter of the happened-upon group start control does not match the closing group delimiter. For example, `( < )` is an error, as is `[this)`.

For example, `("foo" "bar")` will parse a group token - which must be delimited by parentheses - and will then match the sequence `<"foo" "bar">` using the tokens inside the group. Once parsing of the inner rule is finished, no extra tokens may appear in the group; the text `(foo bar)` parses successfully, while `(foo bar baz)` does not.

== Repetition
A repetition rule consists of a main element rule and a separator rule.

To construct a group, `*` and `+` are used. Both pop a separator rule, followed by the main element, and construct a repetition rule. The difference is that `+` must have at least one main element, while `*` can have zero.

For example, `@name "and" +` pushes the rule that parses a sequence of names separated by the keyword `and`. At least one name must appear in the sequence; the text `x and y and z "foo"` by the previous definition parses successfully, as does `x "foo"`, however `"foo"` itself does not parse due to the imposed requirement of `+`. `@name "and" *` will successfully parse `"foo"`; it simply parses no elements. 

=== Trailing
By default, the separator rule must be followed by a main element; separators are not allowed to trail. To enable trailing, the `~` modifier can be used. This pops a repetition rule, enables trailing, and pushes it back.

For example, `@name ',' +~` parses a sequence of names separated by commas, with trailing enabled; `x, y, z` parses sucessfully, as does `x, y,`.

=== Separatorless
A sequence without a separator element can be created with the composed modifier sequence `.*~`, which in total, pops a single rule and pushes a single rule (`*` can be substituted with `+`):
- `.` pushes an empty sequence; this parses nothing and always succeeds.
- `*` pops the empty sequence and the previous rule, creating a repetition rule.
- As the empty sequence always suceeds, the repetition will *always* trail. However, since trailing is disallowed by default, `~` is used to enable this. Without the `~`, the repetition rule will always expect more and will hence always fail.

== Optional
An optional rule is a rule that is allowed to fail. If it fails, it is simply not parsed and is not present.

An optional rule is created with `?`. This pops a rule, makes it optional, and pushes it back.

== Sequence
A sequence rule is a group of other rules; it parses each rule in *sequence*. Sequences are also used for boxing and grouping.

For example, `<foo bar baz>` will first try to parse `foo`, and if it succeeds, it will try to parse `bar` wherever `foo` ended. This is repeated with `baz` at wherever `bar` ended. If at any point one fails, the entire rule fails.

A sequence is started with `<`, and is ended with `>`. Any rules in between are part of the sequence.

More granularly, `<` raw-pushes a _sequence start_ control. `>` raw-pops the stack until it comes across a sequence start control. Any rules popped in between are added to a list of rules, which is the list of rules the sequence consists of. It is an error for any other control `Peg` that is not a sequence start control to be popped before the sequence start control; `< ( >` is an error.

`.` is equivalent to `<>`; pushing an empty sequence.

=== Grouping
A sequence may have only one rule, and it is useful as a general rule grouper. For example, `<this | that>` is effective.

=== Recursion Caveat & Boxing
For recursively-defined rules, there are certain configurations which implementations are allowed to not accept. The strongly-typed reference implementation, for example, does not accept the following:
```rs
a = b c
b = a d
```
For an implementation that generates parsing structures for rules, it may create the `a` rule which contains a field of type `b`. This `b` type may have a field `a`, incurring an infinitely sized type. However, there are ways to circumvent this. Generally, only top-level rules are affected. A rule can be "boxed" by wrapping it with angle brackets. This creates a single-element sequence, which is allowed:
```rs
a = b c
b = <a> d
```

== Choice
A choice rule is also a group of other rules; it parses the first rule that succeeds.

A choice rule is a list of rules separated by `|`. For example, `foo | bar | baz` will first try to parse `foo`, and if it fails, will try to parse `bar` at the same position it started parsing `foo` with. This is repeated with `baz`. If at any point one succeeds, the entire rule succeeds.

More granularly, `|` first parses a single rule ahead of it, continuing to parse on a new stack until no control rules exist on that stack (ensuring sequences and groups are parsed in full), producing a single parsed rule. Then, it will raw-pop the stack and do one of two things:
- If a choice is popped, the rule that was parsed is added as a new variant and the choice is pushed back.
- If a control is popped or the stack is empty, an error is emitted.
- If any other rule is popped, rules will continually be popped until a control is popped (which is then pushed back) or the stack is empty. The rules popped will form a new sequence, and a choice with two variants - the new sequence, and the parsed-ahead rule - is created and pushed.

=== Precedence and Special Stack Behaviour <special-stack-behaviour>
Choices override the pushing and popping behaviours of the stack:
- When a choice is popped from the stack, it is instead the last variant of that choice that is popped instead of the entire choice.
- When the top of the stack is a choice, and a rule is pushed, the last variant is taken, and if it is a sequence, the rule is added to it, otherwise it creates a sequence of the existing variant and the new rule as the new final variant.

These rules essentially flip the precedence of choice from tight to loose; `foo bar | baz qux` is no longer parsed as `foo <bar | baz> qux`, and is instead parsed similar to `<foo bar> | <baz qux>`.