Expand description
Lexxor is a fast, extensible, greedy, single-pass text tokenizer.
Sample output for the string “This is \n1.0 thing.”
use lexxor::token::Token;
Token{ token_type: 4, value: "This".to_string(), line: 1, column: 1, len: 4, precedence: 0};
Token{ token_type: 3, value: " ".to_string(), line: 1, column: 5, len: 1, precedence: 0};
Token{ token_type: 4, value: "is".to_string(), line: 1, column: 6, len: 2, precedence: 0};
Token{ token_type: 3, value: " \n".to_string(), line: 1, column: 8, len: 3, precedence: 0};
Token{ token_type: 2, value: "1.0".to_string(), line: 2, column: 1, len: 3, precedence: 0};
Token{ token_type: 3, value: " ".to_string(), line: 2, column: 4, len: 1, precedence: 0};
Token{ token_type: 4, value: "thing".to_string(), line: 2, column: 5, len: 5, precedence: 0};
Token{ token_type: 5, value: ".".to_string(), line: 2, column: 10, len: 1, precedence: 0};
Lexxor uses a LexxorInput
to provide chars that are fed to
Matcher
instances until the longest match is found, if any. The
match will be returned as a Token
instance. The
Token
includes a type and the string matched as well as the
line and column where the match was made. A custom LexxorInput
can be passed to Lexxor but the library comes with implementations for
InputString
and
InputReader
types.
Lexxor implements Iterator
so it can be used with for
loops.
Custom Matcher
s can also be made though Lexxor comes with:
WordMatcher
matches alphabetic characters such asABCdef
andword
IntegerMatcher
matches integers such as3
or14537
FloatMatcher
matches floats such as434.312
or0.001
ExactMatcher
given a vector of strings matches exactly those strings. You can initialize it with a type to return so you can use multiple ones for different things. For example, oneExactMatcher
can be used to find operators such as==
and+
while another could be used to find block identifiers such as(
and)
.SymbolMatcher
matches all non-alphanumerics*&)_#@
or.
. This is a good catch-all matcher.KeywordMatcher
matches specific passed-in words such asnew
orspecific
. It differs from theExactMatcher
in that it will not match substrings, such as thenew
inrenewable
ornewfangled
.WhitespaceMatcher
matches whitespace such as\t\r\n
Matcher
s can be given a precedence that can make a matcher return its
results even if another matcher has a longer match. For example, both the WordMatcher
and KeywordMatcher
are used at the same time.
Note that matchers cannot find matches that start inside the valid matches of other matchers.
For matching renewable
, the WordMatcher
will make the match even if the ExactMatcher
is looking for new
with a higher precedence because the WordMatcher
will consume all of renewable
without giving other matchers the chance to look inside of it.
Also, while the ExactMatcher
could find the new
inside newfangled
, the WordMatcher
would match newfangled
instead since it is longer, unless the ExactMatcher
is
given a higher precedence in which case it would get to return new
and the next match would
start at fangled
.
To successfully parse an entire stream, Lexxor must have a matcher with which to tokenize every
encountered collection of characters. If a match fails, Lexxor will return Err( [TokenNotFound](LexxError::TokenNotFound))
with the text that could not be matched.
§Panics
For speed, Lexxor does not dynamically allocate buffer space. In Lexxor<CAP>
, CAP
is the maximum
possible token size; if that size is exceeded, a panic will be thrown.
§Example
use lexxor::{matcher, Lexxor, Lexxer};
use lexxor::input::InputString;
use lexxor::matcher::exact::ExactMatcher;
use lexxor::matcher::symbol::SymbolMatcher;
use lexxor::matcher::whitespace::WhitespaceMatcher;
use lexxor::matcher::word::WordMatcher;
use lexxor::token::{TOKEN_TYPE_EXACT, TOKEN_TYPE_WORD, TOKEN_TYPE_WHITESPACE, TOKEN_TYPE_SYMBOL};
let lexxor_input = InputString::new(String::from("The quick\n\nbrown fox."));
let mut lexxor: Box<dyn Lexxer> = Box::new(Lexxor::<512>::new(
Box::new(lexxor_input),
vec![
Box::new(WordMatcher{ index: 0, precedence: 0, running: true }),
Box::new(WhitespaceMatcher { index: 0, column: 0,line: 0,precedence: 0, running: true }),
Box::new(SymbolMatcher { index:0, precedence: 0, running: true }),
// with a precedence of 1 this will match "quick" instead of the word matcher
// We can change the TOKEN_TYPE value returned if we want to have more than one
// ExactMatcher that return different token types.
Box::new(ExactMatcher::build_exact_matcher(vec!["quick"], TOKEN_TYPE_EXACT, 1)),
]
));
assert!(matches!(lexxor.next_token(), Ok(Some(t)) if t.value == "The" && t.token_type == TOKEN_TYPE_WORD && t.line == 1 && t.column == 1));
assert!(matches!(lexxor.next_token(), Ok(Some(t)) if t.token_type == TOKEN_TYPE_WHITESPACE));
// Because the ExactMatcher is looking for `quick` with a precedence higher than
// that of the WordMatcher it will return a match for `quick`.
assert!(matches!(lexxor.next_token(), Ok(Some(t)) if t.value == "quick" && t.token_type == TOKEN_TYPE_EXACT && t.line == 1 && t.column == 5));
assert!(matches!(lexxor.next_token(), Ok(Some(t)) if t.token_type == TOKEN_TYPE_WHITESPACE));
assert!(matches!(lexxor.next_token(), Ok(Some(t)) if t.value == "brown" && t.token_type == TOKEN_TYPE_WORD && t.line == 3 && t.column == 1));
assert!(matches!(lexxor.next_token(), Ok(Some(t)) if t.token_type == TOKEN_TYPE_WHITESPACE));
assert!(matches!(lexxor.next_token(), Ok(Some(t)) if t.value == "fox" && t.token_type == TOKEN_TYPE_WORD && t.line == 3 && t.column == 7));
assert!(matches!(lexxor.next_token(), Ok(Some(t)) if t.value == "." && t.token_type == TOKEN_TYPE_SYMBOL && t.line == 3 && t.column == 10));
assert!(matches!(lexxor.next_token(), Ok(None)));
lexxor.set_input(Box::new(InputString::new(String::from("Hello world!"))));
for token in lexxor {
println!("{}", token.value);
}
Modules§
- input
- The LexxorInput for lexxor
- matcher
- The Matcher trait for lexxor The matcher module provides a set of token matchers for the Lexxor lexer.
- rolling_
char_ buffer - RollingCharBuffer is a fast, fixed size char buffer that can be used as a LIFO or FIFO stack.
- token
- The results of a match
Structs§
- Lexxor
- The lexer itself. Implements Lexxer so you can use
Box<dyn Lexxer>
and don’t have to define theCAP
in var declarations.
Enums§
- Lexx
Error - Errors Lexxorcan return