pub struct Tokenizer<'a> { /* private fields */ }Expand description
The Crossandra tokenizer, operating on literals and patterns.
§Literals
Literals indicate values that have to be exactly matched by the tokenizer. They are represented by a slice of (name, value) pairs. For example, a literal map for Brainfuck would be defined like this:
let literals = [
("add", "+"),
("sub", "-"),
("left", "<"),
("right", ">"),
("read", ","),
("write", "."),
("begin_loop", "["),
("end_loop", "]"),
];Literals take precedence over patterns.
§Patterns
Patterns are regular expressions that match more complex token structures. They are represented
as pairs of strings (name, pattern) in a Vec to maintain a consistent matching order.
The order of patterns matters as the tokenizer will use the first matching pattern it finds.
Duplicate pattern names are not allowed and will result in an error. This crate also provides a
collection of commonly used patterns in the common module. For example, patterns covering
binary, octal, and hexadecimal literals could be defined like this:
let patterns = vec![
("binary".into(), r"0[bB][01]+".into()),
("octal".into(), r"0[Oo][0-7]+".into()),
("hexadecimal".into(), r"(?i)0x[0-9a-f]+".into()),
];§Other options
§ignore_whitespace
Whether to ignore the following whitespace characters:
| Code | Character |
|---|---|
0x9 | Tab (\t) |
0xa | Line feed (\n) |
0xb | Vertical tab |
0xc | Form feed |
0xd | Carriage return (\r) |
0x20 | Space ( ) |
Defaults to false.
§ignored_characters
A set of characters to ignore during tokenization. Defaults to an empty Vec.
§Fast Mode
When all literals are of length 1 and there are no patterns, Crossandra uses a simpler tokenization method.
For instance, tokenizing a 1MB random Brainfuck file with 10% of the file being comments is ~300x faster with Fast Mode (32.5s vs 110ms on Apple M2).
Do note that this is a rather extreme case; for a 1KB file, the speedup is ~2.3x.
Implementations§
Source§impl<'a> Tokenizer<'a>
impl<'a> Tokenizer<'a>
Sourcepub fn new(
literals: &[(&'a str, &'a str)],
patterns: Vec<(String, String)>,
ignored_characters: FxHashSet<char>,
ignore_whitespace: bool,
) -> Result<Self, Error>
pub fn new( literals: &[(&'a str, &'a str)], patterns: Vec<(String, String)>, ignored_characters: FxHashSet<char>, ignore_whitespace: bool, ) -> Result<Self, Error>
Sourcepub fn tokenize(
&'a self,
source: &'a str,
) -> Box<dyn Iterator<Item = Result<Token, Error>> + 'a>
pub fn tokenize( &'a self, source: &'a str, ) -> Box<dyn Iterator<Item = Result<Token, Error>> + 'a>
Sourcepub fn tokenize_lines(
&'a self,
source: &'a str,
) -> impl ParallelIterator<Item = Result<Vec<Token>, Error>> + 'a
pub fn tokenize_lines( &'a self, source: &'a str, ) -> impl ParallelIterator<Item = Result<Vec<Token>, Error>> + 'a
Sourcepub fn with_literals(
self,
literals: &[(&'a str, &'a str)],
) -> Result<Self, Error>
pub fn with_literals( self, literals: &[(&'a str, &'a str)], ) -> Result<Self, Error>
Sourcepub fn with_ignored_characters(
self,
ignored_characters: FxHashSet<char>,
) -> Self
pub fn with_ignored_characters( self, ignored_characters: FxHashSet<char>, ) -> Self
Sets the ignored characters of this Tokenizer and
returns itself.
Sourcepub fn with_ignore_whitespace(self, ignore_whitespace: bool) -> Self
pub fn with_ignore_whitespace(self, ignore_whitespace: bool) -> Self
Sets the ignore_whitespace option of this Tokenizer and
returns itself.
Sourcepub fn set_ignored_characters(&mut self, ignored_characters: FxHashSet<char>)
pub fn set_ignored_characters(&mut self, ignored_characters: FxHashSet<char>)
Sets the ignored characters of this Tokenizer.
Sourcepub fn set_ignore_whitespace(&mut self, ignore_whitespace: bool)
pub fn set_ignore_whitespace(&mut self, ignore_whitespace: bool)
Sets the ignore_whitespace option of this Tokenizer.
Trait Implementations§
impl Eq for Tokenizer<'_>
Auto Trait Implementations§
impl<'a> Freeze for Tokenizer<'a>
impl<'a> RefUnwindSafe for Tokenizer<'a>
impl<'a> Send for Tokenizer<'a>
impl<'a> Sync for Tokenizer<'a>
impl<'a> Unpin for Tokenizer<'a>
impl<'a> UnwindSafe for Tokenizer<'a>
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more