[][src]Crate pretok

pretok

Pretok is a pre-tokenizer (or pre-lexer) for C-like syntaxes. Pretok simplifies subsequent lexers by handling line and block comments, whitespace and strings. Pretok operates as an iterator over an input string of UTF-8 code points.

Given an input string, pretok does the following.

  • Implements the iterator trait where next() returns a sequence of Option<Pretoken> structures.
  • Filters // line comments from the input string.
  • Filters /* block comments */ from the input string
  • Returns "quoted strings with \"escapes\"" as a single Pretoken.
  • Skips whitespace characters.
  • After above filters, returns Pretokens usually delineated by whitespace.
  • Returns the line number and byte offset of each pretoken

Motivation

Common computer language features such comments, line number tracking, whitespace tolerance, etc. introduce corner cases that can make lexing awkward. By imposing a few opinions, the Pretokenizer solves these problems at the earliest stage of processing. This preprocessing normalizes the input stream and simplifies subsequent processing.

Basic Use

The Pretokenizer is an iterator returning a sequence of Pretoken objects from an input string reference. Normally, each returned Pretoken represents at least one actual language token. A subsequent lexing step would split Pretokens into language specific tokens.

Examples

Whitespace typically separates Pretokens and is stripped outside of quoted strings.

    use pretok::{Pretokenizer, Pretoken};
    let mut pt = Pretokenizer::new("Hello World!");
    assert!(pt.next() == Some(Pretoken{s:"Hello", line:1, offset:0}));
    assert!(pt.next() == Some(Pretoken{s:"World!", line:1, offset:6}));
    assert!(pt.next() == None);

Comments are stripped and may also delineate Pretokens.

    use pretok::{Pretokenizer, Pretoken};
    let mut pt = Pretokenizer::new("x/*y*/z");
    assert!(pt.next() == Some(Pretoken{s:"x", line:1, offset:0}));
    assert!(pt.next() == Some(Pretoken{s:"z", line:1, offset:6}));
    assert!(pt.next() == None);

    let mut pt = Pretokenizer::new("x\ny//z");
    assert!(pt.next() == Some(Pretoken{s:"x", line:1, offset:0}));
    assert!(pt.next() == Some(Pretoken{s:"y", line:2, offset:2}));
    assert!(pt.next() == None);

Quoted strings are a single Pretoken.

    use pretok::{Pretokenizer, Pretoken};
    let mut pt = Pretokenizer::new("Hello \"W o r l d!\"");
    assert!(pt.next() == Some(Pretoken{s:"Hello", line:1, offset:0}));
    assert!(pt.next() == Some(Pretoken{s:"\"W o r l d!\"", line:1, offset:6}));
    assert!(pt.next() == None);

Quoted strings create a single Pretoken separate from the surrounding pretoken(s).

    use pretok::{Pretokenizer, Pretoken};
    let mut pt = Pretokenizer::new("x+\"h e l l o\"+z");
    assert!(pt.next() == Some(Pretoken{s:"x+", line:1, offset:0}));
    assert!(pt.next() == Some(Pretoken{s:"\"h e l l o\"", line:1, offset:2}));
    assert!(pt.next() == Some(Pretoken{s:"+z", line:1, offset:13}));
    assert!(pt.next() == None);

Unit Testing

Pretok supports unit tests.

cargo test

Fuzz Testing

Pretok supports fuzz tests. Fuzz testing starts from a corpus of random inputs and then further randomizes those inputs to try to cause crashes and hangs. At the time of writing (Rust 1.46.0), fuzz testing required the nightly build.

To run fuzz tests:

cargo +nightly fuzz run fuzz_target_1

Fuzz tests run until stopped with Ctrl-C. In my experience, fuzz tests will catch a problem almost immediately or not at all.

Cargo fuzz uses LLVM's libFuzzer internally, which provides a vast array of runtime options. To see thh options using the nightly compiler build:

cargo +nightly fuzz run fuzz_target_1 -- -help=1

For example, setting a smaller 5 second timeout for hangs:

cargo +nightly fuzz run fuzz_target_1 -- -timeout=5

Structs

Pretoken

A pretoken object contains a slice of the Pretokenizer input string with lifetime a.

Pretokenizer

The Pretokenizer is an iterator that produces Option<Pretoken> objects over an input string.