Skip to main content

Crate use_token

Crate use_token 

Source
Expand description

§use-token

Composable tokenization primitives for RustUse.

use-token keeps tokenization explicit and small. It handles whitespace splitting, conservative word tokenization, lightweight sentence boundaries, and character spans without claiming to be a full NLP parser.

§Included primitives

  • tokenize_whitespace
  • tokenize_words
  • tokenize_sentences
  • tokenize_chars
  • token_count

§Example

use use_token::{token_count, tokenize_sentences, tokenize_words};

assert_eq!(token_count("Hello, world!"), 2);
assert_eq!(tokenize_words("don't stop").len(), 2);
assert_eq!(tokenize_sentences("One. Two!").len(), 2);

Structs§

Token
A token with its kind and byte span.
TokenSpan
A byte span in the original input string.
TokenizerOptions
Small configuration for future tokenizer extensions.

Enums§

TokenKind
The category assigned to a token.

Functions§

token_count
Counts conservative word tokens.
tokenize_chars
Splits input into Unicode scalar values.
tokenize_sentences
Extracts conservative sentence tokens.
tokenize_whitespace
Splits input on contiguous whitespace.
tokenize_words
Extracts conservative word tokens.