pub struct StrTokenizer<'t> {
    pub keep_whitespace: bool,
    pub keep_newline: bool,
    pub keep_comment: bool,
    pub line_positions: Vec<usize>,
    pub specialeof: &'static str,
    pub tab_spaces: usize,
    pub allow_newline_in_string: bool,
    pub priority_symbols: BTreeMap<&'static str, u32>,
    /* private fields */
Expand description

General-purpose, zero-copy lexical analyzer that produces RawTokens from an str. This tokenizer uses regex, although not for everything. For example, to allow for string literals that contain escaped quotations, a direct loop is implemented. The tokenizer gives the option of returning newlines, whitespaces (with count) and comments as special tokens. It recognizes mult-line string literals, multi-line as well as single-line comments, and returns the starting line and column positions of each token.


  let mut scanner = StrTokenizer::from_str("while (1) fork();//run at your own risk");
  scanner.add_single(';'); // separates ; from following symbols
  while let Some(token) = {
     println!("Token,line,column: {:?}",&token);

this code produces output

  Token,line,column: (Alphanum("while"), 1, 1)
  Token,line,column: (Symbol("("), 1, 7)
  Token,line,column: (Num(1), 1, 8)
  Token,line,column: (Symbol(")"), 1, 9)
  Token,line,column: (Alphanum("fork"), 1, 11)
  Token,line,column: (Symbol("("), 1, 15)
  Token,line,column: (Symbol(")"), 1, 16)
  Token,line,column: (Symbol(";"), 1, 17)
  Token,line,column: (Verbatim("//run at your own risk"), 1, 18)


§keep_whitespace: bool

flag to toggle whether whitespaces should be returned as Whitespace tokens, default is false.

§keep_newline: bool

flag to toggle whether newline characters (‘\n’) are returned as Newline tokens. Default is false. Note that if this flag is set to true then newline characters are treated differently from other whitespaces. For example, when parsing languages like Python, both keep_whitespace and keep_newline should be set to true. Change option in grammar with lexattribute keep_newline=true

§keep_comment: bool

flag to determine if comments are kept and returned as Verbatim tokens, default is false.

§line_positions: Vec<usize>

vector of starting byte position of each line, position 0 not used.

§specialeof: &'static str§tab_spaces: usize

number of whitespaces to count for each tab (default 6). This can be changed with a declaration such as lexattribute tab_spaces=8. Do not set this value to zero.

§allow_newline_in_string: bool

allows string literals to contain non-escaped newline characters: warning: changing the default (false) may reduce the accuracy of error reporting.

§priority_symbols: BTreeMap<&'static str, u32>

Multiset of verbatim symbols that have priority over other categories; sorted by string order. The multiset is implemented as a map from strings to counts.



impl<'t> StrTokenizer<'t>


pub fn new() -> StrTokenizer<'t>

creats a new tokenizer with defaults, does not set input.


pub fn map<G, FM: FnOnce(&mut StrTokenizer<'t>) -> G>(&mut self, f: FM) -> G

applies closure to self, can be used together with lexconditional to invoke custom actions


pub fn current_text(&self) -> &'t str

returns text of the current token, untrimed


pub fn add_double(&mut self, s: &'t str)

adds a symbol of exactly length two. If the length is not two the function has no effect. Note that these symbols override all other types except for leading whitespaces and comments markers, e.g. “//” will have precedence over “/” and “==” will have precedence over “=”.


pub fn add_single(&mut self, c: char)

add a single-character symbol. The type of the symbol overrides other types except for whitespaces, comments and double-character symbols.


pub fn add_triple(&mut self, s: &'t str)

add a 3-character symbol


pub fn add_priority_symbol(&mut self, s: &'static str)

multiset-add a verbatim string as a priority symbol: will be returned as Symbol(s)


pub fn del_priority_symbol(&mut self, s: &'static str)

multiset-remove verbative string as a priority symbol


pub fn skip_to(&mut self, target: &'static str)

Skips to last occurrence of target string, or to end of input. Returns RawToken::Skipto token.


pub fn skip_reset(&mut self)

cancels recoginition of skip_to (called internally)


pub fn skip_match( &mut self, lbr: &'static str, rbr: &'static str, offset: i32, delimit: &'static str, )

StrTokenizer can do a little more than recognize just regular expressions. It can detect matching brackets, and return return the bracket-enclosed text as a RawToken::Skipto token. An offset of 1 is recommended, as this call is usually made after an instance of the opening left-bracket is seen as lookahead. The operation increases a counter, starting with the offset everytime a left-bracket is seen and decreases it with every right-bracket, until counter==0, at which point it returns the skipped text in a RawToken::Skipmatched token. It will top searching when the delimit string is reached. If delimit is the empty string, then it will search until the end of input.


pub fn add_custom(&mut self, tkind: &'static str, reg_expr: &str)

add custom defined regex, will correspond to RawToken::Custom variant. Custom regular expressions should not start with whitespaces and will override all others. Multiple Custom types will be matched by the order in which they where declared in the grammar file.


pub fn set_input(&mut self, inp: &'t str)

sets the input str to be parsed, resets position information. Note: trailing whitespaces are always trimmed from the input.


pub fn set_line_comment(&mut self, cm: &'t str)

sets the symbol that begins a single-line comment. The default is “//”. If this is set to the empty string then no line-comments are recognized.


pub fn set_multiline_comments(&mut self, cm: &'t str)

sets the symbols used to delineate multi-line comments using a whitespace separated string such as “/* */”. These symbols are also the default. Set this to the empty string to disable multi-line comments.


pub fn line(&self) -> usize

the current line that the tokenizer is on


pub fn column(&self) -> usize

the current column of the tokenizer


pub fn current_position(&self) -> usize

returns the current absolute byte position of the Tokenizer


pub fn previous_position(&self) -> usize

returns the previous absolute byte position of the Tokenizer


pub fn get_source(&self) -> &str

returns the source of the tokenizer such as URL or filename


pub fn set_source<'u: 't>(&mut self, s: &'u str)


pub fn current_line(&self) -> &str

gets the current line of the source input


pub fn get_line(&self, i: usize) -> Option<&str>

Retrieves the ith line of the raw input, if line index i is valid. This function is intended to be called once the tokenizer has completed its task of scanning and tokenizing the entire input. Otherwise, it may return None if the tokenizer has not yet scanned up to the line indicated. That is, it is intended for error message generation when evaluating the AST post-parsing.


pub fn get_slice(&self, start: usize, end: usize) -> &str

Retrieves the source string slice at the indicated indices; returns the empty string if indices are invalid. The default implementation returns the empty string.


pub fn reset(&mut self)

reset tokenizer to parse from beginning of input


pub fn backtrack(&mut self, offset: usize)


pub fn next_token(&mut self) -> Option<(RawToken<'t>, usize, usize)>

returns next token, along with starting line and column numbers. This function will return None at end of stream or LexError along with a message printed to stderr if a tokenizer error occured.


impl<'t> StrTokenizer<'t>


pub fn from_source(ls: &'t LexSource<'t>) -> StrTokenizer<'t>

creates a StrTokenizer from a LexSource structure that contains a string representing the contents of the source, and calls StrTokenizer::set_input to reference that string. To create a tokenizer that reads from, for example, a file is:

let source = LexSource::new(source_path).unwrap();
let mut tokenizer = StrTokenizer::from_source(&source);

pub fn from_str(s: &'t str) -> StrTokenizer<'t>

creates a string tokenizer and sets input to give str.

