Crate tokenise

Source
Expand description

§Tokenise

A flexible lexical analyser (tokeniser) for parsing text into configurable token types.

tokenise allows you to split text into tokens based on customisable rules for special characters, delimiters, and comments. It’s designed to be flexible enough to handle various syntax styles while remaining simple to configure.

§Basic Usage

The following example demonstrates how to configure a tokeniser with common syntax elements and process a simple code snippet:

use tokenise::Tokeniser;
 
fn main() {
    // Create a new tokeniser
    let mut tokeniser = Tokeniser::new();
     
    // Configure tokeniser with rules
    tokeniser.add_specials(".,;:!?");
    tokeniser.add_delimiter_pairs(&vec!["()", "[]", "{}"]).unwrap();
    tokeniser.add_balanced_delimiter("\"").unwrap();
    tokeniser.set_sl_comment("//").unwrap();
    tokeniser.set_ml_comment("/*", "*/").unwrap();
     
    // Tokenise some source text
    let source = "let x = 42; // The answer\nprint(\"Hello world!\");";
    let tokens = tokeniser.tokenise(source).unwrap();
     
    // Work with the resulting tokens
    for token in tokens {
        println!("{:?}: '{}'", token.get_state(), token.value());
    }
}

§Features

  • Unicode support (using grapheme clusters)
  • Configurable special characters and delimiters
  • Support for paired delimiters (e.g., parentheses, brackets)
  • Support for balanced delimiters (e.g., quotation marks)
  • Single-line and multi-line comment handling
  • Whitespace and newline preservation

§Token Types

The tokeniser recognises several token types represented by the TokenState enum:

  • Word: Non-special character sequences (anything not identified as a special character or whitespace)
  • LDelimiter/RDelimiter: Left/right delimiters of a pair (e.g., ‘(’, ‘)’)
  • BDelimiter: Balanced delimiters (e.g., quotation marks)
  • SymbolString: Special characters
  • NewLine: Line breaks
  • WhiteSpace: Spaces, tabs, etc.
  • SLComment: Single-line comments
  • MLComment: Multi-line comments

More precise definitions can be found in the documentation for each specific type.

Structs§

Token
Represents a token extracted from the source text during tokenisation.
Tokeniser
A configurable tokeniser for parsing text into meaningful tokens.

Enums§

Side
Represents the categorisation of delimiters into left, right, or balanced types.
TokenState
Represents the type of a token in the tokenisation process.

Functions§

is_grapheme
Checks if a string is exactly one grapheme cluster (user-perceived character).
is_whitespace
Checks if a string consists entirely of whitespace.