Module nlp

Expand description

NLP primitives — string distance, n-grams, tokenization.

Lightweight text analysis utilities for data cleaning and LLM training pipelines. All functions are deterministic and allocation-conscious.

Functions§

ascii_lowercase: Convert a string to lowercase (ASCII-only, allocation-free for ASCII input).
char_ngrams: Extract character-level n-grams with frequency counts.
cosine_similarity: Cosine similarity between two term-frequency vectors.
jaccard_ngram_similarity: Jaccard similarity between the character-level n-gram sets of two strings.
levenshtein: Compute the Levenshtein edit distance between two strings.
levenshtein_similarity: Normalized Levenshtein similarity in [0.0, 1.0].
strip_punctuation: Remove ASCII punctuation from a string.
term_frequency: Compute term frequency (TF) for each word in a string.
tokenize_whitespace: Simple whitespace tokenizer. Returns token spans as (start, end) byte offsets.
tokenize_words: Word-and-punctuation tokenizer. Splits on whitespace, then separates leading/trailing punctuation into their own tokens.
word_ngrams: Extract word-level n-grams with frequency counts.