Expand description

basic_lexer is a basic lexical scanner designed for the first stage of compiler construction, and produces tokens required by a parser. It was originally intended to support the parallel project rustlr, which is a LR-style parser generator, although each project is independent of the other.

For version 0.2.0, a new “zero-copy” tokenizer has been added, consisting of RawToken, StrTokenizer and LexSource. The most important structure is StrTokenizer. The original tokenizer and related constructs, which produced tokens containing owned strings, is still present. However, neither tokenizer is optimal-performance in that they are not built from DFAs. The new tokenizing function, StrTokenizer::next_token, uses regex, and now becomes the focus of the crate. It is now capaple of counting whitespaces (for Python-like languages) and accurately keeps track of the starting line/column position of each token.

Example: given the Cargo.toml file of this crate,

  let source = LexSource::new("Cargo.toml").unwrap();
  let mut tokenizer = StrTokenizer::from_source(&source);
  tokenizer.set_line_comment("#");
  tokenizer.keep_comment=true;
  tokenizer.keep_newline=false;
  tokenizer.keep_whitespace=false; 
  while let Some(token) = tokenizer.next() {
     println!("Token: {:?}",&token);
  } 

This code produces output

 Token: (Symbol("["), 1, 1)
 Token: (Alphanum("package"), 1, 2) 
 Token: (Symbol("]"), 1, 9)
 Token: (Alphanum("name"), 2, 1)
 Token: (Symbol("="), 2, 6)
 Token: (Strlit("\"basic_lexer\""), 2, 8)
 Token: (Alphanum("version"), 3, 1)
 Token: (Symbol("="), 3, 9)
 Token: (Strlit("\"0.2.0\""), 3, 11)
 Token: (Alphanum("edition"), 4, 1)
 Token: (Symbol("="), 4, 9)
 Token: (Strlit("\"2018\""), 4, 11)
 ...
 Token: (Symbol("]"), 8, 35)
 Token: (Verbatim("# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html"), 10, 1)

etc.. The numbers returned alongside each token represent the line and column positions of the start of the token.

Structs

a Token Iterator on a given file

Structure to hold contents of a source (such as contents of file).

Generic str tokenizer that produces RawTokens.

A Token Iterator on a given &str.

Enums

structure produced by StrTokenizer.

Tokens are returned by the iterators Str_tokenizer and File_tokenizer.