Expand description
§Tokenise
A flexible lexical analyser (tokeniser) for parsing text into configurable token types.
tokenise
allows you to split text into tokens based on customisable rules for special characters,
delimiters, and comments. It’s designed to be flexible enough to handle various syntax styles
while remaining simple to configure.
§Basic Usage
The following example demonstrates how to configure a tokeniser with common syntax elements and process a simple code snippet:
use tokenise::Tokeniser;
fn main() {
// Create a new tokeniser
let mut tokeniser = Tokeniser::new();
// Configure tokeniser with rules
tokeniser.add_specials(".,;:!?");
tokeniser.add_delimiter_pairs(&vec!["()", "[]", "{}"]).unwrap();
tokeniser.add_balanced_delimiter("\"").unwrap();
tokeniser.set_sl_comment("//").unwrap();
tokeniser.set_ml_comment("/*", "*/").unwrap();
// Tokenise some source text
let source = "let x = 42; // The answer\nprint(\"Hello world!\");";
let tokens = tokeniser.tokenise(source).unwrap();
// Work with the resulting tokens
for token in tokens {
println!("{:?}: '{}'", token.get_state(), token.value());
}
}
§Features
- Unicode support (using grapheme clusters)
- Configurable special characters and delimiters
- Support for paired delimiters (e.g., parentheses, brackets)
- Support for balanced delimiters (e.g., quotation marks)
- Single-line and multi-line comment handling
- Whitespace and newline preservation
§Token Types
The tokeniser recognises several token types represented by the TokenState
enum:
Word
: Non-special character sequences (anything not identified as a special character or whitespace)LDelimiter
/RDelimiter
: Left/right delimiters of a pair (e.g., ‘(’, ‘)’)BDelimiter
: Balanced delimiters (e.g., quotation marks)SymbolString
: Special charactersNewLine
: Line breaksWhiteSpace
: Spaces, tabs, etc.SLComment
: Single-line commentsMLComment
: Multi-line comments
More precise definitions can be found in the documentation for each specific type.
Structs§
- Token
- Represents a token extracted from the source text during tokenisation.
- Tokeniser
- A configurable tokeniser for parsing text into meaningful tokens.
Enums§
- Side
- Represents the categorisation of delimiters into left, right, or balanced types.
- Token
State - Represents the type of a token in the tokenisation process.
Functions§
- is_
grapheme - Checks if a string is exactly one grapheme cluster (user-perceived character).
- is_
whitespace - Checks if a string consists entirely of whitespace.