Skip to main content

Module text_processing

Module text_processing 

Source
Expand description

Text processing and natural language processing transformations

This module provides comprehensive text processing capabilities including tokenization, normalization, stemming, and linguistic transformations. Designed for preprocessing text data in machine learning pipelines.

§Features

  • Basic text processing: Case conversion, whitespace handling, punctuation removal
  • Tokenization: Flexible text splitting with custom delimiters
  • Stemming: Porter stemmer implementation for word normalization
  • N-gram generation: Extract bigrams, trigrams, and custom n-grams
  • Filtering: Length-based and stopword filtering
  • Pattern replacement: Simple string pattern substitution

Structs§

ChangeCase
Convert text case according to specified mode
CollapseWhitespace
Collapse multiple consecutive whitespace characters into single spaces
FilterByLength
Filter tokens by length constraints
NGramGenerator
Generate n-grams from a sequence of tokens
PorterStemmer
Porter stemmer implementation for English word stemming
RemoveNumbers
Remove numeric digits from text
RemovePunctuation
Remove ASCII punctuation from text
RemoveStopwords
Remove stopwords from a list of tokens
ReplacePattern
Replace string patterns with replacements
ToLowercase
Convert text to lowercase
Tokenize
Tokenize text into words using a specified delimiter
TrimWhitespace
Trim whitespace from beginning and end of text

Enums§

CaseMode
Text case conversion modes

Functions§

bigrams
Create bigram generator
filter_by_length
Create length filter
porter_stemmer
Create Porter stemmer
remove_english_stopwords
Create English stopword remover
tokenize
Create a tokenizer with custom delimiter
tokenize_whitespace
Convenience functions for creating text transforms Create a tokenizer that splits on whitespace
trigrams
Create trigram generator