Expand description
§lexit
A configurable and robust lexical analyzer (lexer) library for Rust.
This crate provides a powerful macro-based approach to defining programming language lexers with support for keywords, operators, identifiers, literals, whitespace, and comments, including intricate features like paired delimiters and multi-line comments.
§Getting Started
Add lexit to your project with cargo add lexit
§Defining a language
Language creation is streamlined with the define_language! macro and the smaller token creation macros, including:
token!(token_name, regular expression match, priority, whether to store and return the text that matches the token)keyword!(token_name, regular expression, priority)# Keywords are meant for set keywords, so storing the match would be unnecessary.ignore_token!(token_name, regex, priority)# This is for matches you want to catch and then ignore. These are not returned as tokens and are usually used to ignore whitespace.open_pair!(token_name, regex, counterpart_name, priority)# This is used to define token pairs that must exist together like ‘(’ and ‘)’. The language will return an error if you do not define a closing for an open pair. The lexer will error if there are any unclosed open pairs.close_pair!(token_name, regex, couterpart_name, priority)# This is the counterpart to open pair. Likewise, the language will error if there is no matching open pair, and the lexer will error if there is a random closing pair without a matching open pair.ignore_until!(token_name, regex, ending_regex, priority)# This is generally used for comments. It says “When I match on regex, I will ignore all characters until I match on ending_regex, and then I will resume regular activity.”
Note: Priority is only used when two tokens could match with the same length. The lexer uses maximal munch, so a longer length match will always have higher priority.
Below is an example of defining a small subset of the C language:
use lexit::{define_language, token, keyword, ignore_token, open_pair, close_pair, ignore_until};
let language_result = define_language! {
ignore_token!("WHITESPACE", r"\s+", 10),
ignore_until!("SINGLE_LINE_COMMENT", r"//", r"\n", 5),
ignore_until!("MULTI_LINE_COMMENT", r"/\*", r"\*/", 5),
keyword!("INT_KEYWORD", r"\bint\b", 100),
keyword!("IF_KEYWORD", r"\bif\b", 100),
keyword!("ELSE_KEYWORD", r"\belse\b", 100),
keyword!("WHILE_KEYWORD", r"\bwhile\b", 100),
keyword!("RETURN_KEYWORD", r"\breturn\b", 100),
keyword!("VOID_KEYWORD", r"\bvoid\b", 100),
keyword!("EQUALS_COMP", r"==", 95),
keyword!("LESS_THAN_COMP", r"<", 90),
keyword!("ASSIGN", r"=", 90),
keyword!("PLUS", r"\+", 90),
keyword!("MINUS", r"-", 90),
keyword!("MULTIPLY", r"\*", 90),
keyword!("DIVIDE", r"/", 90),
open_pair!("LEFT_PAREN", r"\(", "RIGHT_PAREN", 80),
close_pair!("RIGHT_PAREN", r"\)", "LEFT_PAREN", 80),
open_pair!("LEFT_BRACE", r"\{", "RIGHT_BRACE", 80),
close_pair!("RIGHT_BRACE", r"\}", "LEFT_BRACE", 80),
keyword!("SEMICOLON", r";", 70),
keyword!("COMMA", r",", 70),
token!("IDENTIFIER", r"[a-zA-Z_][a-zA-Z0-9_]*", 60, true),
token!("INTEGER_LITERAL", r"\d+", 60, true),
};The use of macros is optional and only offers conciseness. Here is an example of defining a simple arithmetic language without macros:
use lexit::{Language, TokenDefinition, TokenBehavior, PairDefinition, PairDirection};
let definitions = vec![
TokenDefinition::new(
"WHITESPACE".to_string(),
r"\s+",
TokenBehavior::Ignore,
0,
false,
)
.unwrap(),
TokenDefinition::new("PLUS".to_string(), r"\+", TokenBehavior::None, 50, false).unwrap(),
TokenDefinition::new("MINUS".to_string(), r"-", TokenBehavior::None, 50, false).unwrap(),
TokenDefinition::new(
"MULTIPLY".to_string(),
r"\*",
TokenBehavior::None,
50,
false,
)
.unwrap(),
TokenDefinition::new("DIVIDE".to_string(), r"/", TokenBehavior::None, 50, false).unwrap(),
TokenDefinition::new(
"LEFT_PAREN".to_string(),
r"\(",
TokenBehavior::Pair(PairDefinition::new(
PairDirection::Open,
"RIGHT_PAREN".to_string(),
)),
60,
false,
)
.unwrap(),
TokenDefinition::new(
"RIGHT_PAREN".to_string(),
r"\)",
TokenBehavior::Pair(PairDefinition::new(
PairDirection::Close,
"LEFT_PAREN".to_string(),
)),
60,
false,
)
.unwrap(),
TokenDefinition::new(
"FLOAT_LITERAL".to_string(),
r"\d+\.\d+",
TokenBehavior::None,
70,
true,
)
.unwrap(),
TokenDefinition::new(
"INTEGER_LITERAL".to_string(),
r"\d+",
TokenBehavior::None,
65,
true,
)
.unwrap(),
];
let language = Language::new(definitions);§Regular Expressions
In order to implement the DFA construction for lexing, this library includes its own regular expression logic. It supports common regex patterns including:
- Literals:
'a' - Quantifiers:
'a{2, 3}','a*','a+' - Range:
'[a-c]' - Not-Range:
'[^a-c]' - Groups:
'(a)' - Concatenation:
'ab' - Alternation:
'a|b' - StartAnchor:
'^a'# Start anchor means it must be at the start of a line - EndAnchor:
'a$'# End anchor means it must be at the end of a line - Any Character:
'.' - Escape Characters:
\b: word boundary\d: ASCII digit\s: ASCII whitespace\w: ASCII word character
Note that while the regular expressions support the full Unicode character set, the escape characters \d, \s, and \w only work for ASCII.
Below are the matching characters for each of the three escape characters mentioned above:
- \d: [0-9]
- \s: [ \t\r\n\u{000C}]
- \w: [a-zA-Z0-9_]
§Tokens
A lexer is created by calling Lexer::new(language). This will create the DFA for that language, and then you can lex with lexer.lex(text). The lex method returns a Result<Vec<Token>, String>. Tokens have 4 fields:
name:String# The token nametext_match:Option<String># The matching text if store is enabledrow:usize# The row in the text of the tokencol:usize# The column in the text of the token
§Lexing
Below is an example of lexing text:
use lexit::{Lexer, Language, TokenDefinition, TokenBehavior};
let lexer = Lexer::new(language.unwrap()).unwrap();
let input_string = "(1 + 2) * 3";
let tokens_result = lexer.lex(input_string);
if let Ok(tokens) = tokens_result {
for token in tokens {
println!("Token: {}", token.get_name());
}
} else if let Err(e) = tokens_result {
eprintln!("Lexing error: {}", e);
}Modules§
- language
- This module provides the Language struct used to represent a lexable language
- lex
- This module provides the core
Lexerfunctionality, responsible for taking a source text and breaking it down into a stream ofTokens based on a definedLanguage. It handles token matching, priority resolution, line/column tracking, and paired delimiter validation. - regex
- This module provides a custom regular expression engine used for lexing. It defines the structure of parsed regular expression patterns and includes the logic for parsing a regex string into this structured representation.
Macros§
- __
lexor_ create_ token_ definition - Internal macro used by other token definition macros to create a
TokenDefinition. - close_
pair - Creates a
TokenDefinitionfor a closing paired token (e.g.,)or}). - define_
language - Defines a complete
Languageby providing a list ofTokenDefinitionresults. - ignore_
token - Creates a
TokenDefinitionfor a token that should be ignored by the lexer. - ignore_
until - Creates a
TokenDefinitionfor a token that marks the start of an ignored block until a specific end regex is matched. - keyword
- Creates a
TokenDefinitionfor a keyword. - open_
pair - Creates a
TokenDefinitionfor an opening paired token (e.g.,(or{). - token
- Creates a generic
TokenDefinition.
Structs§
- Language
- Represents a defined programming language, containing a collection of token definitions and enforcing rules for valid language construction, especially concerning paired tokens.
- Lexer
- The main Lexer struct, responsible for tokenizing input text.
- Pair
Definition - Defines the properties of a paired token, such as parentheses or braces.
- Token
- Represents a single token identified by the lexer.
- Token
Definition - Defines a single token type for a programming language, including its name, regular expression pattern, behavior, priority, and whether its matched text should be stored.
Enums§
- Pair
Direction - Specifies whether a paired token is an opening or closing delimiter.
- Token
Behavior - Defines the specific behavior associated with a token after it has been matched by the lexer.