Expand description
Build and manage lexical analyzers.
The first step of the syntax-parsing pipeline is called lexical-analysis. During this phase, The input text is separated into consecutive sequences of characters that have some atomic syntactic meaning, known as lexemes (or tokens).
Lexemes are usually classified into categories, or lexeme types, that identify “groups” of lexemes that have similar syntactic meaning: identifier, integer literal, operator, white space, etc. Each category is specified by a LexemeDescriptor, which defines the regex pattern that matches lexemes of that type.
Finally, the computational unit responsible for extracting the lexemes that a given input text consists of is known as a lexical analyzer, and is compiled from a set of LexemeDescriptors.
§Example
let lexical_analyzer = LexicalAnalyzer::new(vec![
// Integer literals
LexemeDescriptor::new(
MyLexemeType::Integer,
Regex::concat(vec![
Regex::optional(
Regex::union(vec![Regex::single_char('+'), Regex::single_char('-')])
),
Regex::plus_from(Regex::character_range('0', '9')),
])
),
// The addition operator
LexemeDescriptor::special_char(MyLexemeType::Addition, '+'),
// Invalid numbers
LexemeDescriptor::keyword(MyLexemeType::NotANumber, "NaN"),
]);
// Use the lexical analyzer to parse structured input text
let input_text = &mut ByteArrayReader::from_string(String::from("-2+NaN+-45"));
let extracted_lexemes = lexical_analyzer.analyze(input_text);
// Validate the parsed output
let actual_lexemes = vec![
Lexeme::new(MyLexemeType::Integer, "-2"),
Lexeme::new(MyLexemeType::Addition, "+"),
Lexeme::new(MyLexemeType::NotANumber, "NaN"),
Lexeme::new(MyLexemeType::Addition, "+"),
Lexeme::new(MyLexemeType::Integer, "-45"),
];
assert_eq!(extracted_lexemes.collect::<Vec<Lexeme<MyLexemeType>>>(), actual_lexemes);
Structs§
- Lexeme
- A lexeme extracted from input text by a lexical analyzers.
- Lexeme
Descriptor - Describes a category of lexemes with similar syntactic meanings.
- Lexical
Analyzer - A lexical analyzer.
Enums§
- Regex
- A regular-expression pattern over raw bytes.