Expand description
§use-token
Composable tokenization primitives for RustUse.
use-token keeps tokenization explicit and small. It handles whitespace splitting, conservative
word tokenization, lightweight sentence boundaries, and character spans without claiming to be a
full NLP parser.
§Included primitives
tokenize_whitespacetokenize_wordstokenize_sentencestokenize_charstoken_count
§Example
use use_token::{token_count, tokenize_sentences, tokenize_words};
assert_eq!(token_count("Hello, world!"), 2);
assert_eq!(tokenize_words("don't stop").len(), 2);
assert_eq!(tokenize_sentences("One. Two!").len(), 2);Structs§
- Token
- A token with its kind and byte span.
- Token
Span - A byte span in the original input string.
- Tokenizer
Options - Small configuration for future tokenizer extensions.
Enums§
- Token
Kind - The category assigned to a token.
Functions§
- token_
count - Counts conservative word tokens.
- tokenize_
chars - Splits input into Unicode scalar values.
- tokenize_
sentences - Extracts conservative sentence tokens.
- tokenize_
whitespace - Splits input on contiguous whitespace.
- tokenize_
words - Extracts conservative word tokens.