Crate basic_lexer
source · [−]Expand description
basic_lexer is a basic lexical scanner designed for the first stage of compiler construction, and produces tokens required by a parser. It was originally intended to support the parallel project rustlr, which is a LR-style parser generator, although each project is independent of the other.
For version 0.2.0, a new “zero-copy” tokenizer has been added, consisting of RawToken, StrTokenizer and LexSource. The most important structure is StrTokenizer. The original tokenizer and related constructs, which produced tokens containing owned strings, is still present. However, neither tokenizer is optimal-performance in that they are not built from DFAs. The new tokenizing function, StrTokenizer::next_token, uses regex, and now becomes the focus of the crate. It is now capaple of counting whitespaces (for Python-like languages) and accurately keeps track of the starting line/column position of each token.
Example: given the Cargo.toml file of this crate,
let source = LexSource::new("Cargo.toml").unwrap();
let mut tokenizer = StrTokenizer::from_source(&source);
tokenizer.set_line_comment("#");
tokenizer.keep_comment=true;
tokenizer.keep_newline=false;
tokenizer.keep_whitespace=false;
while let Some(token) = tokenizer.next() {
println!("Token: {:?}",&token);
}
This code produces output
Token: (Symbol("["), 1, 1)
Token: (Alphanum("package"), 1, 2)
Token: (Symbol("]"), 1, 9)
Token: (Alphanum("name"), 2, 1)
Token: (Symbol("="), 2, 6)
Token: (Strlit("\"basic_lexer\""), 2, 8)
Token: (Alphanum("version"), 3, 1)
Token: (Symbol("="), 3, 9)
Token: (Strlit("\"0.2.0\""), 3, 11)
Token: (Alphanum("edition"), 4, 1)
Token: (Symbol("="), 4, 9)
Token: (Strlit("\"2018\""), 4, 11)
...
Token: (Symbol("]"), 8, 35)
Token: (Verbatim("# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html"), 10, 1)
etc.. The numbers returned alongside each token represent the line and column positions of the start of the token.
Structs
Structure to hold contents of a source (such as contents of file).
Generic str tokenizer that produces RawTokens.
Enums
structure produced by StrTokenizer.
Tokens are returned by the iterators Str_tokenizer and File_tokenizer.