Expand description
NLP primitives โ string distance, n-grams, tokenization.
Lightweight text analysis utilities for data cleaning and LLM training pipelines. All functions are deterministic and allocation-conscious.
Functionsยง
- ascii_
lowercase - Convert a string to lowercase (ASCII-only, allocation-free for ASCII input).
- char_
ngrams - Extract character-level n-grams with frequency counts.
- cosine_
similarity - Cosine similarity between two term-frequency vectors.
- jaccard_
ngram_ similarity - Jaccard similarity between the character-level n-gram sets of two strings.
- levenshtein
- Compute the Levenshtein edit distance between two strings.
- levenshtein_
similarity - Normalized Levenshtein similarity in [0.0, 1.0].
- strip_
punctuation - Remove ASCII punctuation from a string.
- term_
frequency - Compute term frequency (TF) for each word in a string.
- tokenize_
whitespace - Simple whitespace tokenizer. Returns token spans as
(start, end)byte offsets. - tokenize_
words - Word-and-punctuation tokenizer. Splits on whitespace, then separates leading/trailing punctuation into their own tokens.
- word_
ngrams - Extract word-level n-grams with frequency counts.