Crate tokenise

Expand description

§Tokenise

A flexible lexical analyser (tokeniser) for parsing text into configurable token types.

tokenise allows you to split text into tokens based on customisable rules for special characters, delimiters, and comments. It’s designed to be flexible enough to handle various syntax styles while remaining simple to configure.

§Basic Usage

The following example demonstrates how to configure a tokeniser with common syntax elements and process a simple code snippet:

use tokenise::Tokeniser;
 
fn main() {
    // Create a new tokeniser
    let mut tokeniser = Tokeniser::new();
     
    // Configure tokeniser with rules
    tokeniser.add_specials(".,;:!?");
    tokeniser.add_delimiter_pairs(&vec!["()", "[]", "{}"]).unwrap();
    tokeniser.add_balanced_delimiter("\"").unwrap();
    tokeniser.set_sl_comment("//").unwrap();
    tokeniser.set_ml_comment("/*", "*/").unwrap();
     
    // Tokenise some source text
    let source = "let x = 42; // The answer\nprint(\"Hello world!\");";
    let tokens = tokeniser.tokenise(source).unwrap();
     
    // Work with the resulting tokens
    for token in tokens {
        println!("{:?}: '{}'", token.get_state(), token.value());
    }
}

§Features

Unicode support (using grapheme clusters)
Configurable special characters and delimiters
Support for paired delimiters (e.g., parentheses, brackets)
Support for balanced delimiters (e.g., quotation marks)
Single-line and multi-line comment handling
Whitespace and newline preservation

§Token Types

The tokeniser recognises several token types represented by the TokenState enum:

Word: Non-special character sequences (anything not identified as a special character or whitespace)
LDelimiter/RDelimiter: Left/right delimiters of a pair (e.g., ‘(’, ‘)’)
BDelimiter: Balanced delimiters (e.g., quotation marks)
SymbolString: Special characters
NewLine: Line breaks
WhiteSpace: Spaces, tabs, etc.
SLComment: Single-line comments
MLComment: Multi-line comments

More precise definitions can be found in the documentation for each specific type.

Structs§

Token: Represents a token extracted from the source text during tokenisation.
Tokeniser: A configurable tokeniser for parsing text into meaningful tokens.

Enums§

Side: Represents the categorisation of delimiters into left, right, or balanced types.
TokenState: Represents the type of a token in the tokenisation process.

Functions§

is_grapheme: Checks if a string is exactly one grapheme cluster (user-perceived character).
is_whitespace: Checks if a string consists entirely of whitespace.

Crate tokenise

Crate tokenise Copy item path

§Tokenise

§Basic Usage

§Features

§Token Types

Structs§

Enums§

Functions§

Crate tokenise