[−][src]Crate pretok
pretok
Pretok is a pre-tokenizer (or pre-lexer) for C-like syntaxes. Pretok simplifies subsequent lexers by handling line and block comments, whitespace and strings. Pretok operates as an iterator over an input string of UTF-8 code points.
Given an input string, pretok does the following.
- Implements the iterator trait where
next()
returns a sequence ofOption<Pretoken>
structures. - Filters
// line comments
from the input string. - Filters
/* block comments */
from the input string - Returns
"quoted strings with \"escapes\""
as a singlePretoken
. - Skips whitespace characters.
- After above filters, returns
Pretokens
usually delineated by whitespace. - Returns the line number and byte offset of each pretoken
Motivation
Common computer language features such comments, line number tracking, whitespace tolerance, etc. introduce corner cases that can make lexing awkward. By imposing a few opinions, the Pretokenizer solves these problems at the earliest stage of processing. This preprocessing normalizes the input stream and simplifies subsequent processing.
Basic Use
The Pretokenizer is an iterator returning a sequence of Pretoken objects from an input string reference. Normally, each returned Pretoken represents at least one actual language token. A subsequent lexing step would split Pretokens into language specific tokens.
Examples
Whitespace typically separates Pretokens and is stripped outside of quoted strings.
use pretok::{Pretokenizer, Pretoken}; let mut pt = Pretokenizer::new("Hello World!"); assert!(pt.next() == Some(Pretoken{s:"Hello", line:1, offset:0})); assert!(pt.next() == Some(Pretoken{s:"World!", line:1, offset:6})); assert!(pt.next() == None);
Comments are stripped and may also delineate Pretokens.
use pretok::{Pretokenizer, Pretoken}; let mut pt = Pretokenizer::new("x/*y*/z"); assert!(pt.next() == Some(Pretoken{s:"x", line:1, offset:0})); assert!(pt.next() == Some(Pretoken{s:"z", line:1, offset:6})); assert!(pt.next() == None); let mut pt = Pretokenizer::new("x\ny//z"); assert!(pt.next() == Some(Pretoken{s:"x", line:1, offset:0})); assert!(pt.next() == Some(Pretoken{s:"y", line:2, offset:2})); assert!(pt.next() == None);
Quoted strings are a single Pretoken.
use pretok::{Pretokenizer, Pretoken}; let mut pt = Pretokenizer::new("Hello \"W o r l d!\""); assert!(pt.next() == Some(Pretoken{s:"Hello", line:1, offset:0})); assert!(pt.next() == Some(Pretoken{s:"\"W o r l d!\"", line:1, offset:6})); assert!(pt.next() == None);
Quoted strings create a single Pretoken separate from the surrounding pretoken(s).
use pretok::{Pretokenizer, Pretoken}; let mut pt = Pretokenizer::new("x+\"h e l l o\"+z"); assert!(pt.next() == Some(Pretoken{s:"x+", line:1, offset:0})); assert!(pt.next() == Some(Pretoken{s:"\"h e l l o\"", line:1, offset:2})); assert!(pt.next() == Some(Pretoken{s:"+z", line:1, offset:13})); assert!(pt.next() == None);
Unit Testing
Pretok supports unit tests.
cargo test
Fuzz Testing
Pretok supports fuzz tests. Fuzz testing starts from a corpus of random inputs and then further randomizes those inputs to try to cause crashes and hangs. At the time of writing (Rust 1.46.0), fuzz testing required the nightly build.
To run fuzz tests:
cargo +nightly fuzz run fuzz_target_1
Fuzz tests run until stopped with Ctrl-C. In my experience, fuzz tests will catch a problem almost immediately or not at all.
Cargo fuzz uses LLVM's libFuzzer internally, which provides a vast array of runtime options. To see thh options using the nightly compiler build:
cargo +nightly fuzz run fuzz_target_1 -- -help=1
For example, setting a smaller 5 second timeout for hangs:
cargo +nightly fuzz run fuzz_target_1 -- -timeout=5
Structs
Pretoken | A pretoken object contains a slice of the |
Pretokenizer | The Pretokenizer is an iterator that produces Option<Pretoken> objects over an input string. |