Module lex

Module lex 

Source
Expand description

Main module for lex library functionality

This module orchestrates the complete lex parsing pipeline. Lex is a simple format,
and yet quite hard to parse. Tactically it is stateful, recursive, line based and
indentation significant. The combination of these makes it a parsing nightmare.

While these are all true, the format is designed with enough constraints so that,
if correctly implemented, it's quite easy to parse. However it does mean that using
available libraries simply won't work. Libraries can handle context free, token
based, non indentation significant grammars. At best, they are flexible enough to
handle one of these patterns, but never all of them.

The Parser Design

After significant research and experimentation we settled on a design that is a bit
off-the-beaten-path, but nicely breaks down complexity into very simple chunks.

Instead of a straight lexing -> parsing pipeline, lex-parser does the following steps:

    1. Semantic Indentation: we convert indent tokens into semantic events as indent
       and dedent. This is a stateful machine that tracks changes in indentation
       levels and emits indent and dedent events. See
       [semantic_indentation](lexing::transformations::semantic_indentation).

    2. Line Grouping: we group tokens into lines. Here we split tokens by line breaks
       into groups of tokens. Each group is a Line token and which category is
       determined by the tokens inside. See [line_grouping](lexing::line_grouping).

    3. Tree Building (LineContainer): we build a tree of line groups reflecting the
       nesting structure. This groups line tokens into a hierarchical tree structure
       based on Indent/Dedent markers. See [to_line_container](token::to_line_container).

    4. Context Injection: we inject context information into each group allowing parsing
       to only read each level's lines. For example, sessions require preceding blank
       lines, but for a session that is the first element in its parent, that preceding
       blank line belongs to the parent. A synthetic token is injected to capture this
       context.

    5. Parsing by Level: parsing only needs to read each level's lines, which can
       include a LineContainer (that is, there is child content there), with no tree
       traversal needed. Parsing is done declaratively by processing the grammar patterns
       (regular strings) through rust's regex engine. See [parsing](parsing) module.

On their own, each step is fairly simple, their total sum being some 500 lines of code.
Additionally they are easy to test and verify.

The key here is that parsing only needs to read each level's line, which can include
a LineContainer (that is, there is child content there), with no tree traversal needed.
Parsing is done declaratively by processing the grammar patterns (regular strings)
through rust's regex engine. Put another way, once tokens are grouped into a tree of
lines, parsing can be done in a regular single pass.

Whether passes 2-4 are indeed lexing or actual parsing is left as a bike shedding
exercise. The criteria for calling these lexing has been that each transformation is
simply a grouping of tokens, there is no semantics.

Pipeline Separation

In addition to the transformations over tokens, the codebase separates the semantic
analysis (in [parsing](parsing)) from the AST building (in [building](building)) and
finally the final document assembly step (in [assembling](assembling)). These are done
with the same intention: keeping complexity localized and shallow at every one of these
layers and making the system more testable. Line grouping and tree building happen at
the parsing stage, after lexing has already produced indent/dedent-aware flat tokens.

For the complete end-to-end pipeline documentation, see [parsing](parsing) module.

Modulesยง

annotation
Annotation-specific helpers shared across lexer, parser, and builders.
assembling
Assembling module
ast
AST definitions and utilities for the lex format
building
AST building utilities for parsers
formats
Output format implementations for AST and token serialization
inlines
Inline parsing primitives
lexing
Lexer
loader
Document loading and transform execution
parsing
Parsing module for the lex format
testing
Testing utilities for AST assertions
token
Core token types and helpers shared across the lexer, parser, and tooling.
transforms
Transform pipeline infrastructure