Skip to main content

Module tokenizer

Module tokenizer 

Source
Expand description

Text tokenization utilities.

Zero-dependency tokenizer that splits text at whitespace and punctuation boundaries, normalizes to lowercase, and supports n-gram generation.

Functionsยง

default_tokenize
Tokenize text into lowercase words, stripping punctuation.
ngrams
Generate n-grams from a list of tokens.