Expand description
Text processing and natural language processing transformations
This module provides comprehensive text processing capabilities including tokenization, normalization, stemming, and linguistic transformations. Designed for preprocessing text data in machine learning pipelines.
§Features
- Basic text processing: Case conversion, whitespace handling, punctuation removal
- Tokenization: Flexible text splitting with custom delimiters
- Stemming: Porter stemmer implementation for word normalization
- N-gram generation: Extract bigrams, trigrams, and custom n-grams
- Filtering: Length-based and stopword filtering
- Pattern replacement: Simple string pattern substitution
Structs§
- Change
Case - Convert text case according to specified mode
- Collapse
Whitespace - Collapse multiple consecutive whitespace characters into single spaces
- Filter
ByLength - Filter tokens by length constraints
- NGram
Generator - Generate n-grams from a sequence of tokens
- Porter
Stemmer - Porter stemmer implementation for English word stemming
- Remove
Numbers - Remove numeric digits from text
- Remove
Punctuation - Remove ASCII punctuation from text
- Remove
Stopwords - Remove stopwords from a list of tokens
- Replace
Pattern - Replace string patterns with replacements
- ToLowercase
- Convert text to lowercase
- Tokenize
- Tokenize text into words using a specified delimiter
- Trim
Whitespace - Trim whitespace from beginning and end of text
Enums§
- Case
Mode - Text case conversion modes
Functions§
- bigrams
- Create bigram generator
- filter_
by_ length - Create length filter
- porter_
stemmer - Create Porter stemmer
- remove_
english_ stopwords - Create English stopword remover
- tokenize
- Create a tokenizer with custom delimiter
- tokenize_
whitespace - Convenience functions for creating text transforms Create a tokenizer that splits on whitespace
- trigrams
- Create trigram generator