tokenise 0.1.0

A flexible tokeniser library for parsing text
Documentation
  • Coverage
  • 100%
    44 out of 44 items documented26 out of 32 items with examples
  • Size
  • Source code size: 55.97 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 3.2 MB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 11s Average build duration of successful builds.
  • all releases: 11s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • HaineSensei/tokenise
    0 0 0
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • HaineSensei

Tokenise

A flexible lexical analyser (tokeniser) for parsing text into configurable token types.

Crates.io Documentation License: MIT

Overview

tokenise allows you to split text into tokens based on customisable rules for special characters, delimiters, and comments. It's designed to be flexible enough to handle various syntax styles while remaining simple to configure.

Features

  • Unicode support (using grapheme clusters)
  • Configurable special characters and delimiters
  • Support for paired delimiters (e.g., parentheses, brackets)
  • Support for balanced delimiters (e.g., quotation marks)
  • Single-line and multi-line comment handling
  • Whitespace and newline preservation

Usage

Add this to your Cargo.toml:

[dependencies]
tokenise = "0.1.0"

Basic Example

use tokenise::{Tokeniser, TokenState};

fn main() {
    // Create a new tokeniser
    let mut tokeniser = Tokeniser::new();
    
    // Configure tokeniser with rules
    tokeniser.add_specials(".,;:!?");
    tokeniser.add_delimiter_pairs(&vec!["()", "[]", "{}"]).unwrap();
    tokeniser.add_balanced_delimiter("\"").unwrap();
    tokeniser.set_sl_comment("//").unwrap();
    tokeniser.set_ml_comment("/*", "*/").unwrap();
    
    // Tokenise some source text
    let source = "let x = 42; // The answer\nprint(\"Hello world!\");";
    let tokens = tokeniser.tokenise(source).unwrap();
    
    // Work with the resulting tokens
    for token in tokens {
        println!("{:?}: '{}'", token.get_state(), token.value());
    }
}

Token Types

The tokeniser recognises several token types represented by the TokenState enum:

  • Word: Non-special character sequences
  • LDelimiter/RDelimiter: Left/right delimiters of a pair (e.g., '(', ')')
  • BDelimiter: Balanced delimiters (e.g., quotation marks)
  • SymbolString: Special characters
  • NewLine: Line breaks
  • WhiteSpace: Spaces, tabs, etc.
  • SLComment: Single-line comments
  • MLComment: Multi-line comments

License

This project is licensed under the MIT License - see the LICENSE file for details.