Crate lexit

Expand description

§lexit

A configurable and robust lexical analyzer (lexer) library for Rust.

This crate provides a powerful macro-based approach to defining programming language lexers with support for keywords, operators, identifiers, literals, whitespace, and comments, including intricate features like paired delimiters and multi-line comments.

§Getting Started

Add lexit to your project with cargo add lexit

§Defining a language

Language creation is streamlined with the define_language! macro and the smaller token creation macros, including:

token! (token_name, regular expression match, priority, whether to store and return the text that matches the token)
keyword! (token_name, regular expression, priority) # Keywords are meant for set keywords, so storing the match would be unnecessary.
ignore_token! (token_name, regex, priority) # This is for matches you want to catch and then ignore. These are not returned as tokens and are usually used to ignore whitespace.
open_pair! (token_name, regex, counterpart_name, priority) # This is used to define token pairs that must exist together like ‘(’ and ‘)’. The language will return an error if you do not define a closing for an open pair. The lexer will error if there are any unclosed open pairs.
close_pair! (token_name, regex, couterpart_name, priority) # This is the counterpart to open pair. Likewise, the language will error if there is no matching open pair, and the lexer will error if there is a random closing pair without a matching open pair.
ignore_until! (token_name, regex, ending_regex, priority) # This is generally used for comments. It says “When I match on regex, I will ignore all characters until I match on ending_regex, and then I will resume regular activity.”

Note: Priority is only used when two tokens could match with the same length. The lexer uses maximal munch, so a longer length match will always have higher priority.

Below is an example of defining a small subset of the C language:

use lexit::{define_language, token, keyword, ignore_token, open_pair, close_pair, ignore_until};

let language_result = define_language! {
        ignore_token!("WHITESPACE", r"\s+", 10),

        ignore_until!("SINGLE_LINE_COMMENT", r"//", r"\n", 5),
        ignore_until!("MULTI_LINE_COMMENT", r"/\*", r"\*/", 5),

        keyword!("INT_KEYWORD", r"\bint\b", 100),
        keyword!("IF_KEYWORD", r"\bif\b", 100),
        keyword!("ELSE_KEYWORD", r"\belse\b", 100),
        keyword!("WHILE_KEYWORD", r"\bwhile\b", 100),
        keyword!("RETURN_KEYWORD", r"\breturn\b", 100),
        keyword!("VOID_KEYWORD", r"\bvoid\b", 100),

        keyword!("EQUALS_COMP", r"==", 95),
        keyword!("LESS_THAN_COMP", r"<", 90),
        keyword!("ASSIGN", r"=", 90),
        keyword!("PLUS", r"\+", 90),
        keyword!("MINUS", r"-", 90),
        keyword!("MULTIPLY", r"\*", 90),
        keyword!("DIVIDE", r"/", 90),


        open_pair!("LEFT_PAREN", r"\(", "RIGHT_PAREN", 80),
        close_pair!("RIGHT_PAREN", r"\)", "LEFT_PAREN", 80),
        open_pair!("LEFT_BRACE", r"\{", "RIGHT_BRACE", 80),
        close_pair!("RIGHT_BRACE", r"\}", "LEFT_BRACE", 80),

        keyword!("SEMICOLON", r";", 70),
        keyword!("COMMA", r",", 70),

        token!("IDENTIFIER", r"[a-zA-Z_][a-zA-Z0-9_]*", 60, true),

        token!("INTEGER_LITERAL", r"\d+", 60, true),
    };

The use of macros is optional and only offers conciseness. Here is an example of defining a simple arithmetic language without macros:

use lexit::{Language, TokenDefinition, TokenBehavior, PairDefinition, PairDirection};

let definitions = vec![
        TokenDefinition::new(
            "WHITESPACE".to_string(),
            r"\s+",
            TokenBehavior::Ignore,
            0,
            false,
        )
        .unwrap(),
        TokenDefinition::new("PLUS".to_string(), r"\+", TokenBehavior::None, 50, false).unwrap(),
        TokenDefinition::new("MINUS".to_string(), r"-", TokenBehavior::None, 50, false).unwrap(),
        TokenDefinition::new(
            "MULTIPLY".to_string(),
            r"\*",
            TokenBehavior::None,
            50,
            false,
        )
        .unwrap(),
        TokenDefinition::new("DIVIDE".to_string(), r"/", TokenBehavior::None, 50, false).unwrap(),
        TokenDefinition::new(
            "LEFT_PAREN".to_string(),
            r"\(",
            TokenBehavior::Pair(PairDefinition::new(
                PairDirection::Open,
                "RIGHT_PAREN".to_string(),
            )),
            60,
            false,
        )
        .unwrap(),
        TokenDefinition::new(
            "RIGHT_PAREN".to_string(),
            r"\)",
            TokenBehavior::Pair(PairDefinition::new(
                PairDirection::Close,
                "LEFT_PAREN".to_string(),
            )),
            60,
            false,
        )
        .unwrap(),
        TokenDefinition::new(
            "FLOAT_LITERAL".to_string(),
            r"\d+\.\d+",
            TokenBehavior::None,
            70,
            true,
        )
        .unwrap(),
        TokenDefinition::new(
            "INTEGER_LITERAL".to_string(),
            r"\d+",
            TokenBehavior::None,
            65,
            true,
        )
        .unwrap(),
    ];

    let language = Language::new(definitions);

§Regular Expressions

In order to implement the DFA construction for lexing, this library includes its own regular expression logic. It supports common regex patterns including:

Literals: 'a'
Quantifiers: 'a{2, 3}', 'a*', 'a+'
Range: '[a-c]'
Not-Range: '[^a-c]'
Groups: '(a)'
Concatenation: 'ab'
Alternation: 'a|b'
StartAnchor: '^a' # Start anchor means it must be at the start of a line
EndAnchor: 'a$' # End anchor means it must be at the end of a line
Any Character: '.'
Escape Characters:
- \b: word boundary
- \d: ASCII digit
- \s: ASCII whitespace
- \w: ASCII word character

Note that while the regular expressions support the full Unicode character set, the escape characters \d, \s, and \w only work for ASCII.

Below are the matching characters for each of the three escape characters mentioned above:

\d: [0-9]
\s: [ \t\r\n\u{000C}]
\w: [a-zA-Z0-9_]

§Tokens

A lexer is created by calling Lexer::new(language). This will create the DFA for that language, and then you can lex with lexer.lex(text). The lex method returns a Result<Vec<Token>, String>. Tokens have 4 fields:

name: String # The token name
text_match: Option<String> # The matching text if store is enabled
row: usize # The row in the text of the token
col: usize # The column in the text of the token

§Lexing

Below is an example of lexing text:

use lexit::{Lexer, Language, TokenDefinition, TokenBehavior};


let lexer = Lexer::new(language.unwrap()).unwrap();
let input_string = "(1 + 2) * 3";
let tokens_result = lexer.lex(input_string);

if let Ok(tokens) = tokens_result {
    for token in tokens {
        println!("Token: {}", token.get_name());
    }
} else if let Err(e) = tokens_result {
    eprintln!("Lexing error: {}", e);
}

Modules§

language: This module provides the Language struct used to represent a lexable language
lex: This module provides the core Lexer functionality, responsible for taking a source text and breaking it down into a stream of Tokens based on a defined Language. It handles token matching, priority resolution, line/column tracking, and paired delimiter validation.
regex: This module provides a custom regular expression engine used for lexing. It defines the structure of parsed regular expression patterns and includes the logic for parsing a regex string into this structured representation.

Macros§

__lexor_create_token_definition: Internal macro used by other token definition macros to create a TokenDefinition.
close_pair: Creates a TokenDefinition for a closing paired token (e.g., ) or }).
define_language: Defines a complete Language by providing a list of TokenDefinition results.
ignore_token: Creates a TokenDefinition for a token that should be ignored by the lexer.
ignore_until: Creates a TokenDefinition for a token that marks the start of an ignored block until a specific end regex is matched.
keyword: Creates a TokenDefinition for a keyword.
open_pair: Creates a TokenDefinition for an opening paired token (e.g., ( or {).
token: Creates a generic TokenDefinition.

Structs§

Language: Represents a defined programming language, containing a collection of token definitions and enforcing rules for valid language construction, especially concerning paired tokens.
Lexer: The main Lexer struct, responsible for tokenizing input text.
PairDefinition: Defines the properties of a paired token, such as parentheses or braces.
Token: Represents a single token identified by the lexer.
TokenDefinition: Defines a single token type for a programming language, including its name, regular expression pattern, behavior, priority, and whether its matched text should be stored.

Enums§

PairDirection: Specifies whether a paired token is an opening or closing delimiter.
TokenBehavior: Defines the specific behavior associated with a token after it has been matched by the lexer.

Crate lexit

Crate lexit Copy item path

§lexit

§Getting Started

§Defining a language

§Regular Expressions

§Tokens

§Lexing

Modules§

Macros§

Structs§

Enums§

Crate lexit