1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
//! Grammar Pattern Definitions
//!
//! This module defines the declarative grammar patterns used by the parser. Patterns
//! are defined as regex rules and are tried in declaration order for correct
//! disambiguation according to the grammar specification.
//!
//! Markers
//!
//! Markers are characters or small character sequences that have meaning in the grammar.
//! There is only one syntax marker, that is a marker that is Lex introduced. All others
//! are naturally occurring in ordinary text, and with the meaning they already convey.
//!
//! The Lex marker (::):
//! In keeping with Lex's ethos of putting content first there is only one formal
//! syntax element: the lex-marker, a double colon (::). This is used only in
//! metadata, in Data nodes. See [Data](crate::lex::ast::elements::data::Data).
//!
//! Sequence Markers (Natural):
//! Serial elements in Lex like lists and sessions can be decorated by sequence markers.
//! These vary from plain formatting (dash) to explicit sequencing as in numbers,
//! letters and roman numerals. These can be separated by periods or parenthesis and
//! come in short and extended forms:
//! <sequence-marker> = <plain-marker> | (<ordered-marker><separator>)+
//! Examples are -, 1., a., a), 1.b.II. and so on.
//!
//! Subject Markers (Natural):
//! Some elements take the form of subject and content, as in definitions and verbatim
//! blocks. The subject is marked by an ending colon (:).
//!
//! Lines
//!
//! Being line based, all the grammar needs is to have line tokens in order to parse any
//! level of elements. Only annotations and end of verbatim blocks use data nodes, that
//! means that pretty much all of Lex needs to be parsed from naturally occurring text
//! lines, indentation and blank lines.
//!
//! Since this still is happening in the lexing stage, each line must be tokenized into
//! one category. In the real world, a line might be more than one possible category.
//! For example a line might have a sequence marker and a subject marker (for example
//! "1. Recap:").
//!
//! For this reason, line tokens can be OR tokens at times, and at other times the order
//! of line categorization is crucial to getting the right result. While there are only
//! a few consequential marks in lines (blank, data, subject, list) having them
//! denormalized is required to have parsing simpler.
//!
//! The definitive set is the LineType enum (blank, data marker, data, subject,
//! list, subject-or-list-item, paragraph, dialog, indent, dedent), and containers are
//! a separate structural node, not a line token.
//!
//! Grammar Parse Order
//!
//! Patterns are matched in declaration order for correct disambiguation:
//! 1. verbatim-block - requires closing annotation, tried first for disambiguation
//! 2. annotation_block - block annotation with indented content
//! 3. annotation_single - single-line annotation only
//! 4. list_no_blank - 2+ list items without preceding blank (anywhere)
//! 5. list - preceding blank line + 2+ list items (blank consumed as node)
//! 6. session - requires subject + blank + indent (with context conditions)
//! 7. definition - requires subject + immediate indent
//! 8. paragraph (imperative) - any content-line or sequence thereof, stopping
//! before list starts (2+ list-like lines) and definition starts
//! (subject + container). Matched imperatively, not by regex.
//! 9. blank_line_group - one or more consecutive blank lines
//!
//! This ordering ensures that more specific patterns (like verbatim blocks) are matched
//! before more general ones (like paragraphs).
use Lazy;
use Regex;
/// Lazy-compiled regex for extracting list items from the list group capture.
///
/// This regex identifies individual list items and their optional nested containers
/// within the matched list pattern.
pub static LIST_ITEM_REGEX: =
new;
/// Grammar patterns as regex rules with names and patterns.
///
/// Order matters: patterns are tried in declaration order for correct disambiguation.
/// Each pattern is a tuple of (pattern_name, regex_pattern_string).
///
/// # Pattern Structure
///
/// - Named capture groups (e.g., `(?P<start>...)`) allow extracting specific parts
/// - Token types in angle brackets (e.g., `<data-marker-line>`) match grammar symbols
/// - `<container>` represents a nested indented block
/// - Quantifiers like `+` (one or more) and `{2,}` (two or more) enforce grammar rules
pub const GRAMMAR_PATTERNS: & = &;