Skip to main content

Module nlp

Module nlp 

Source
Expand description

NLP primitives โ€” string distance, n-grams, tokenization.

Lightweight text analysis utilities for data cleaning and LLM training pipelines. All functions are deterministic and allocation-conscious.

Functionsยง

ascii_lowercase
Convert a string to lowercase (ASCII-only, allocation-free for ASCII input).
char_ngrams
Extract character-level n-grams with frequency counts.
cosine_similarity
Cosine similarity between two term-frequency vectors.
jaccard_ngram_similarity
Jaccard similarity between the character-level n-gram sets of two strings.
levenshtein
Compute the Levenshtein edit distance between two strings.
levenshtein_similarity
Normalized Levenshtein similarity in [0.0, 1.0].
strip_punctuation
Remove ASCII punctuation from a string.
term_frequency
Compute term frequency (TF) for each word in a string.
tokenize_whitespace
Simple whitespace tokenizer. Returns token spans as (start, end) byte offsets.
tokenize_words
Word-and-punctuation tokenizer. Splits on whitespace, then separates leading/trailing punctuation into their own tokens.
word_ngrams
Extract word-level n-grams with frequency counts.