Crate lexxor

Source
Expand description

Lexxor is a fast, extensible, greedy, single-pass text tokenizer.

Sample output for the string “This is \n1.0 thing.”

use lexxor::token::Token;
Token{ token_type: 4, value: "This".to_string(), line: 1, column: 1, len: 4, precedence: 0};
Token{ token_type: 3, value: " ".to_string(), line: 1, column: 5, len: 1, precedence: 0};
Token{ token_type: 4, value: "is".to_string(), line: 1, column: 6, len: 2, precedence: 0};
Token{ token_type: 3, value: "  \n".to_string(), line: 1, column: 8, len: 3, precedence: 0};
Token{ token_type: 2, value: "1.0".to_string(), line: 2, column: 1, len: 3, precedence: 0};
Token{ token_type: 3, value: " ".to_string(), line: 2, column: 4, len: 1, precedence: 0};
Token{ token_type: 4, value: "thing".to_string(), line: 2, column: 5, len: 5, precedence: 0};
Token{ token_type: 5, value: ".".to_string(), line: 2, column: 10, len: 1, precedence: 0};

Lexxor uses a LexxorInput to provide chars that are fed to Matcher instances until the longest match is found, if any. The match will be returned as a Token instance. The Token includes a type and the string matched as well as the line and column where the match was made. A custom LexxorInput can be passed to Lexxor but the library comes with implementations for InputString and InputReader types.

Lexxor implements Iterator so it can be used with for loops.

Custom Matchers can also be made though Lexxor comes with:

  • WordMatcher matches alphabetic characters such as ABCdef and word
  • IntegerMatcher matches integers such as 3 or 14537
  • FloatMatcher matches floats such as 434.312 or 0.001
  • ExactMatcher given a vector of strings matches exactly those strings. You can initialize it with a type to return so you can use multiple ones for different things. For example, one ExactMatcher can be used to find operators such as == and + while another could be used to find block identifiers such as ( and ).
  • SymbolMatcher matches all non-alphanumerics *&)_#@ or .. This is a good catch-all matcher.
  • KeywordMatcher matches specific passed-in words such as new or specific. It differs from the ExactMatcher in that it will not match substrings, such as the new in renewable or newfangled.
  • WhitespaceMatcher matches whitespace such as or \t\r\n

Matchers can be given a precedence that can make a matcher return its results even if another matcher has a longer match. For example, both the WordMatcher and KeywordMatcher are used at the same time.

Note that matchers cannot find matches that start inside the valid matches of other matchers. For matching renewable, the WordMatcher will make the match even if the ExactMatcher is looking for new with a higher precedence because the WordMatcher will consume all of renewable without giving other matchers the chance to look inside of it.

Also, while the ExactMatcher could find the new inside newfangled, the WordMatcher would match newfangled instead since it is longer, unless the ExactMatcher is given a higher precedence in which case it would get to return new and the next match would start at fangled.

To successfully parse an entire stream, Lexxor must have a matcher with which to tokenize every encountered collection of characters. If a match fails, Lexxor will return Err( [TokenNotFound](LexxError::TokenNotFound)) with the text that could not be matched.

§Panics

For speed, Lexxor does not dynamically allocate buffer space. In Lexxor<CAP>, CAP is the maximum possible token size; if that size is exceeded, a panic will be thrown.

§Example

use lexxor::{matcher, Lexxor, Lexxer};
use lexxor::input::InputString;
use lexxor::matcher::exact::ExactMatcher;
use lexxor::matcher::symbol::SymbolMatcher;
use lexxor::matcher::whitespace::WhitespaceMatcher;
use lexxor::matcher::word::WordMatcher;
use lexxor::token::{TOKEN_TYPE_EXACT, TOKEN_TYPE_WORD, TOKEN_TYPE_WHITESPACE, TOKEN_TYPE_SYMBOL};

let lexxor_input = InputString::new(String::from("The quick\n\nbrown fox."));

let mut lexxor: Box<dyn Lexxer> = Box::new(Lexxor::<512>::new(
  Box::new(lexxor_input),
  vec![
    Box::new(WordMatcher{ index: 0, precedence: 0, running: true }),
    Box::new(WhitespaceMatcher { index: 0, column: 0,line: 0,precedence: 0, running: true }),
    Box::new(SymbolMatcher { index:0, precedence: 0, running: true }),
    // with a precedence of 1 this will match "quick" instead of the word matcher
    // We can change the TOKEN_TYPE value returned if we want to have more than one
    // ExactMatcher that return different token types.
    Box::new(ExactMatcher::build_exact_matcher(vec!["quick"], TOKEN_TYPE_EXACT, 1)),
  ]
));

assert!(matches!(lexxor.next_token(), Ok(Some(t)) if t.value == "The" && t.token_type == TOKEN_TYPE_WORD && t.line == 1 && t.column == 1));
assert!(matches!(lexxor.next_token(), Ok(Some(t)) if t.token_type == TOKEN_TYPE_WHITESPACE));
// Because the ExactMatcher is looking for `quick` with a precedence higher than
// that of the WordMatcher it will return a match for `quick`.
assert!(matches!(lexxor.next_token(), Ok(Some(t)) if t.value == "quick" && t.token_type == TOKEN_TYPE_EXACT && t.line == 1 && t.column == 5));
assert!(matches!(lexxor.next_token(), Ok(Some(t)) if t.token_type == TOKEN_TYPE_WHITESPACE));
assert!(matches!(lexxor.next_token(), Ok(Some(t)) if t.value == "brown" && t.token_type == TOKEN_TYPE_WORD && t.line == 3 && t.column == 1));
assert!(matches!(lexxor.next_token(), Ok(Some(t)) if t.token_type == TOKEN_TYPE_WHITESPACE));
assert!(matches!(lexxor.next_token(), Ok(Some(t)) if t.value == "fox" && t.token_type == TOKEN_TYPE_WORD && t.line == 3 && t.column == 7));
assert!(matches!(lexxor.next_token(), Ok(Some(t)) if t.value == "." && t.token_type == TOKEN_TYPE_SYMBOL && t.line == 3 && t.column == 10));
assert!(matches!(lexxor.next_token(), Ok(None)));

lexxor.set_input(Box::new(InputString::new(String::from("Hello world!"))));
for token in lexxor {
    println!("{}", token.value);
}

Modules§

input
The LexxorInput for lexxor
matcher
The Matcher trait for lexxor The matcher module provides a set of token matchers for the Lexxor lexer.
rolling_char_buffer
RollingCharBuffer is a fast, fixed size char buffer that can be used as a LIFO or FIFO stack.
token
The results of a match

Structs§

Lexxor
The lexer itself. Implements Lexxer so you can use Box<dyn Lexxer> and don’t have to define the CAP in var declarations.

Enums§

LexxError
Errors Lexxorcan return

Traits§

Lexxer
A trait for Lexxor, so you can use Box<dyn Lexxer> and don’t have to define the CAP in var declarations.