1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
//! Main module for lex library functionality
//!
//! This module orchestrates the complete lex parsing pipeline. Lex is a simple format,
//! and yet quite hard to parse. Tactically it is stateful, recursive, line based and
//! indentation significant. The combination of these makes it a parsing nightmare.
//!
//! While these are all true, the format is designed with enough constraints so that,
//! if correctly implemented, it's quite easy to parse. However it does mean that using
//! available libraries simply won't work. Libraries can handle context free, token
//! based, non indentation significant grammars. At best, they are flexible enough to
//! handle one of these patterns, but never all of them.
//!
//! The Parser Design
//!
//! After significant research and experimentation we settled on a design that is a bit
//! off-the-beaten-path, but nicely breaks down complexity into very simple chunks.
//!
//! Instead of a straight lexing -> parsing pipeline, lex-parser does the following steps:
//!
//! 1. Semantic Indentation: we convert indent tokens into semantic events as indent
//! and dedent. This is a stateful machine that tracks changes in indentation
//! levels and emits indent and dedent events. See
//! [semantic_indentation](lexing::transformations::semantic_indentation).
//!
//! 2. Line Grouping: we group tokens into lines. Here we split tokens by line breaks
//! into groups of tokens. Each group is a Line token and which category is
//! determined by the tokens inside. See [line_grouping](lexing::line_grouping).
//!
//! 3. Tree Building (LineContainer): we build a tree of line groups reflecting the
//! nesting structure. This groups line tokens into a hierarchical tree structure
//! based on Indent/Dedent markers. See [to_line_container](token::to_line_container).
//!
//! 4. Context Injection: we inject context information into each group allowing parsing
//! to only read each level's lines. For example, sessions require preceding blank
//! lines, but for a session that is the first element in its parent, that preceding
//! blank line belongs to the parent. A synthetic token is injected to capture this
//! context.
//!
//! 5. Parsing by Level: parsing only needs to read each level's lines, which can
//! include a LineContainer (that is, there is child content there), with no tree
//! traversal needed. Parsing is done declaratively by processing the grammar patterns
//! (regular strings) through rust's regex engine. See [parsing](parsing) module.
//!
//! On their own, each step is fairly simple, their total sum being some 500 lines of code.
//! Additionally they are easy to test and verify.
//!
//! The key here is that parsing only needs to read each level's line, which can include
//! a LineContainer (that is, there is child content there), with no tree traversal needed.
//! Parsing is done declaratively by processing the grammar patterns (regular strings)
//! through rust's regex engine. Put another way, once tokens are grouped into a tree of
//! lines, parsing can be done in a regular single pass.
//!
//! Whether passes 2-4 are indeed lexing or actual parsing is left as a bike shedding
//! exercise. The criteria for calling these lexing has been that each transformation is
//! simply a grouping of tokens, there is no semantics.
//!
//! Pipeline Separation
//!
//! In addition to the transformations over tokens, the codebase separates the semantic
//! analysis (in [parsing](parsing)) from the AST building (in [building](building)) and
//! finally the final document assembly step (in [assembling](assembling)). These are done
//! with the same intention: keeping complexity localized and shallow at every one of these
//! layers and making the system more testable. Line grouping and tree building happen at
//! the parsing stage, after lexing has already produced indent/dedent-aware flat tokens.
//!
//! For the complete end-to-end pipeline documentation, see [parsing](parsing) module.