Module text_processing

Expand description

Text processing and natural language processing transformations

This module provides comprehensive text processing capabilities including tokenization, normalization, stemming, and linguistic transformations. Designed for preprocessing text data in machine learning pipelines.

§Features

Basic text processing: Case conversion, whitespace handling, punctuation removal
Tokenization: Flexible text splitting with custom delimiters
Stemming: Porter stemmer implementation for word normalization
N-gram generation: Extract bigrams, trigrams, and custom n-grams
Filtering: Length-based and stopword filtering
Pattern replacement: Simple string pattern substitution

Structs§

ChangeCase: Convert text case according to specified mode
CollapseWhitespace: Collapse multiple consecutive whitespace characters into single spaces
FilterByLength: Filter tokens by length constraints
NGramGenerator: Generate n-grams from a sequence of tokens
PorterStemmer: Porter stemmer implementation for English word stemming
RemoveNumbers: Remove numeric digits from text
RemovePunctuation: Remove ASCII punctuation from text
RemoveStopwords: Remove stopwords from a list of tokens
ReplacePattern: Replace string patterns with replacements
ToLowercase: Convert text to lowercase
Tokenize: Tokenize text into words using a specified delimiter
TrimWhitespace: Trim whitespace from beginning and end of text

Enums§

CaseMode: Text case conversion modes

Functions§

bigrams: Create bigram generator
filter_by_length: Create length filter
porter_stemmer: Create Porter stemmer
remove_english_stopwords: Create English stopword remover
tokenize: Create a tokenizer with custom delimiter
tokenize_whitespace: Convenience functions for creating text transforms Create a tokenizer that splits on whitespace
trigrams: Create trigram generator

Module text_processing

Module text_processing Copy item path

§Features

Structs§

Enums§

Functions§

Module text_processing