Crate token_iter

Source
Expand description

This crate makes it easier to write tokenizers for textual languages. A tokenizer takes a sequence of characters as input and produces a sequence of “tokens” – instances of some type that categorizes groups of characters. For example, the text “foo < 10” might be tokenized into three tokens representing the identifier “foo”, the less-than symbol, and the integer 10.

Note that a sequence of characters that is considered a unit is called a “lexeme,” and the lexeme gets converted to a token (typically an enum value). For example, the lexeme “while” might be mapped to a Token::While enum value, and “731” to Token::Int(731).

This library was designed with the following principles in mind:

  1. The library should NOT define tokens, or place constraints on how they are represented. The library’s token type should be an opaque generic parameter, with no type constraints.
  2. The library should NOT have any ideas about how the text is interpreted, beyond that a lexeme may not span lines. For example, the library should not assume that whitespace is insignificant, have any notion of identifiers or numbers, or limit line length.
  3. The logic deciding what constitutes a lexeme should be expressed in Rust code (rather than, for example, regular expressions).
  4. The API should give you an Iterator over the tokens in an &str, or in an Iterator of &str’s.
  5. The library should automatically add the line number and column range for a token.
  6. The library should be very time-efficient.
  7. The library should be no_std and have no dependencies. In particular, it should not allocate heap memory for lexemes, instead yielding &str’s of substrings of the input.
  8. Invalid tokens are tokens, not errors. Tokenization shouldn’t stop just because it doesn’t recognize something. And you want the same line/column info for bad tokens as you do for good ones. So the tokenizer just produces tokens, not Results that contain either a token or an error.
  9. The library should work with multibyte characters.
  10. The API should support a functional programming style.

To use the library, you must

  1. define a type for your tokens (typically, but not necessarily, an enum)
  2. write a tokenizer function that uses a Lexer (a type provided by this library) to recognize lexemes and convert them to tokens
  3. call either
    tokens_in_line(line, &tokenizer), if you know your code is a single line, or
    tokens_in(line_iter, &tokenizer), to get line numbers with the tokens

Here is a sequence diagram showing how tokens_in_line works with your tokenizer function to generate a token. tokens_in runs tokens_in_line on each line, flattening the results and adding line numbers.

And here is a fairly beefy example:

use token_iter::*;

// A type for the tokens.
#[derive(Clone, Debug, PartialEq)]
enum Token {
    LT, LE,                 // < <=
    GT, GE,                 // > >=
    EqEq, EQ,               // == =
    LCurly, RCurly,         // { }
    While, If,
    Identifier(String),     // e.g. foo12
    Int(u64),               // e.g. 273
    BadInt(String),         // e.g. 9873487239482398477132498723423987234
    Unrecognized(char)
}

// Produces a token, using a Lexer to examine the characters.
fn tokenizer(lx: &mut Lexer) -> Option<Token> {
    use Token::*;
    let is_digit = |c| char::is_ascii_digit(&c);
    Some(
        match lx.skip_while(char::is_whitespace).next()? {
            '<' => if lx.at('=') {LE} else {LT},
            '>' => if lx.at('=') {GE} else {GT},
            '=' => if lx.at('=') {EqEq} else {EQ},
            '{' => LCurly,
            '}' => RCurly,
            c if c.is_alphabetic() =>
                   match lx.take_while(char::is_alphanumeric).get() {
                       "while" => While,
                       "if" => If,
                       s => Identifier(s.into())
                   },
            c if is_digit(c) =>
                   lx.take_while(is_digit).map( |s|
                       if let Ok(n) = s.parse::<u64>() {
                           Int(n)
                       } else {
                           BadInt(s.into())
                       }
                   ),
            c => Unrecognized(c)
        }
    )
}

fn main() {
    let code = r#"
        if foo > bar {
            foo = 1
        }
    "#;
    for (line_num, col_range, token) in tokens_in(code.lines(), &tokenizer) {
        println!("On line {line_num} at columns {col_range:?}: {token:?}");
    }
}

// On line 1 at columns 8..10: If
// On line 1 at columns 11..14: Identifier("foo")
// On line 1 at columns 15..16: GT
// On line 1 at columns 17..20: Identifier("bar")
// On line 1 at columns 21..22: LCurly
// On line 2 at columns 12..15: Identifier("foo")
// On line 2 at columns 16..17: EQ
// On line 2 at columns 18..19: Int(1)
// On line 3 at columns 8..9: RCurly

Structs§

Lexer
The tokenizer you write will be passed a Lexer as an argument, and will call methods on it to figure out what the next lexeme is and, if needed, get its text. The Lexer keeps track of the column range of the lexeme.

Functions§

tokens_in
Returns an Iterator over the tokens found in the specified lines, along with their line numbers and column ranges (both of which start at 0).
tokens_in_line
Returns an Iterator over the tokens found in the specified line, along with their column ranges. Column numbers start at 0.