Expand description
ยงSciRS2 Text - Natural Language Processing
scirs2-text provides comprehensive text processing and NLP capabilities, offering tokenization, TF-IDF vectorization, word embeddings, sentiment analysis, topic modeling, and text classification with SIMD acceleration and parallel processing.
ยง๐ฏ Key Features
- Tokenization: Word, sentence, N-gram, BPE, regex tokenizers
- Vectorization: TF-IDF, count vectorizers, word embeddings
- Text Processing: Stemming, lemmatization, normalization, stopword removal
- Embeddings: Word2Vec (Skip-gram, CBOW), GloVe loading
- Similarity: Cosine, Jaccard, Levenshtein, phonetic algorithms
- NLP: Sentiment analysis, topic modeling (LDA), text classification
- Performance: SIMD operations, parallel processing, sparse matrices
ยง๐ฆ Module Overview
| SciRS2 Module | Python Equivalent | Description |
|---|---|---|
tokenize | nltk.tokenize | Text tokenization utilities |
vectorize | sklearn.feature_extraction.text.TfidfVectorizer | TF-IDF and count vectorization |
embeddings | gensim.models.Word2Vec | Word embeddings (Word2Vec) |
sentiment | nltk.sentiment | Sentiment analysis |
topic_modeling | sklearn.decomposition.LatentDirichletAllocation | Topic modeling (LDA) |
stemming | nltk.stem | Stemming and lemmatization |
ยง๐ Quick Start
[dependencies]
scirs2-text = "0.1.0-rc.2"use scirs2_text::{tokenize::WordTokenizer, vectorize::TfidfVectorizer, Tokenizer, Vectorizer};
// Tokenization
let tokenizer = WordTokenizer::default();
let tokens = tokenizer.tokenize("Hello, world!").unwrap();
// TF-IDF vectorization
let docs = vec!["Hello world", "Good morning world"];
let mut vectorizer = TfidfVectorizer::new(false, true, Some("l2".to_string()));
let matrix = vectorizer.fit_transform(&docs).unwrap();ยง๐ Version: 0.1.0-rc.2 (October 03, 2025)
ยงQuick Start
use scirs2_text::{
tokenize::WordTokenizer,
vectorize::TfidfVectorizer,
sentiment::LexiconSentimentAnalyzer,
Tokenizer, Vectorizer
};
// Basic tokenization
let tokenizer = WordTokenizer::default();
let tokens = tokenizer.tokenize("Hello, world! This is a test.").unwrap();
// TF-IDF vectorization
let documents = vec![
"The quick brown fox jumps over the lazy dog",
"A quick brown dog outpaces a quick fox",
"The lazy dog sleeps all day"
];
let mut vectorizer = TfidfVectorizer::new(false, true, Some("l2".to_string()));
let matrix = vectorizer.fit_transform(&documents).unwrap();
// Sentiment analysis
let analyzer = LexiconSentimentAnalyzer::with_basiclexicon();
let sentiment = analyzer.analyze("I love this library!").unwrap();
println!("Sentiment: {:?}", sentiment.sentiment);ยงArchitecture
The module is organized into focused sub-modules:
tokenize: Text tokenization utilitiesvectorize: Document vectorization and TF-IDFembeddings: Word embedding training and utilitiessentiment: Sentiment analysis toolstopic_modeling: Topic modeling with LDAstring_metrics: String similarity and distance metricspreprocess: Text cleaning and normalizationstemming: Stemming and lemmatizationparallel: Parallel processing utilitiessimd_ops: SIMD-accelerated operations
ยงPerformance
SciRS2 Text is designed for high performance:
- SIMD acceleration for string operations
- Parallel processing for large document collections
- Memory-efficient sparse matrix representations
- Zero-copy string processing where possible
- Optimized algorithms with complexity guarantees
Re-exportsยง
pub use classification::TextClassificationMetrics;pub use classification::TextClassificationPipeline;pub use classification::TextDataset;pub use classification::TextFeatureSelector;pub use cleansing::expand_contractions;pub use cleansing::normalize_currencies;pub use cleansing::normalize_numbers;pub use cleansing::normalize_ordinals;pub use cleansing::normalize_percentages;pub use cleansing::normalize_unicode;pub use cleansing::normalize_whitespace;pub use cleansing::remove_accents;pub use cleansing::replace_emails;pub use cleansing::replace_urls;pub use cleansing::AdvancedTextCleaner;pub use distance::cosine_similarity;pub use distance::jaccard_similarity;pub use distance::levenshtein_distance;pub use domain_processors::Domain;pub use domain_processors::DomainProcessorConfig;pub use domain_processors::FinancialTextProcessor;pub use domain_processors::LegalTextProcessor;pub use domain_processors::MedicalTextProcessor;pub use domain_processors::NewsTextProcessor;pub use domain_processors::PatentTextProcessor;pub use domain_processors::ProcessedDomainText;pub use domain_processors::ScientificTextProcessor;pub use domain_processors::SocialMediaTextProcessor;pub use domain_processors::UnifiedDomainProcessor;pub use embeddings::Word2Vec;pub use embeddings::Word2VecAlgorithm;pub use embeddings::Word2VecConfig;pub use enhanced_vectorize::EnhancedCountVectorizer;pub use enhanced_vectorize::EnhancedTfidfVectorizer;pub use error::Result;pub use error::TextError;pub use huggingface_compat::ClassificationResult;pub use huggingface_compat::FeatureExtractionPipeline;pub use huggingface_compat::FillMaskPipeline;pub use huggingface_compat::FillMaskResult;pub use huggingface_compat::FormatConverter;pub use huggingface_compat::HfConfig;pub use huggingface_compat::HfEncodedInput;pub use huggingface_compat::HfHub;pub use huggingface_compat::HfModelAdapter;pub use huggingface_compat::HfPipeline;pub use huggingface_compat::HfTokenizer;pub use huggingface_compat::HfTokenizerConfig;pub use huggingface_compat::QuestionAnsweringPipeline;pub use huggingface_compat::QuestionAnsweringResult;pub use huggingface_compat::TextClassificationPipeline as HfTextClassificationPipeline;pub use huggingface_compat::ZeroShotClassificationPipeline;pub use information_extraction::AdvancedExtractedInformation;pub use information_extraction::AdvancedExtractionPipeline;pub use information_extraction::ConfidenceScorer;pub use information_extraction::CoreferenceChain;pub use information_extraction::CoreferenceMention;pub use information_extraction::CoreferenceResolver;pub use information_extraction::DocumentInformationExtractor;pub use information_extraction::DocumentSummary;pub use information_extraction::Entity;pub use information_extraction::EntityCluster;pub use information_extraction::EntityLinker;pub use information_extraction::EntityType;pub use information_extraction::Event;pub use information_extraction::ExtractedInformation;pub use information_extraction::InformationExtractionPipeline;pub use information_extraction::KeyPhraseExtractor;pub use information_extraction::KnowledgeBaseEntry;pub use information_extraction::LinkedEntity;pub use information_extraction::MentionType;pub use information_extraction::PatternExtractor;pub use information_extraction::Relation;pub use information_extraction::RelationExtractor;pub use information_extraction::RuleBasedNER;pub use information_extraction::StructuredDocumentInformation;pub use information_extraction::TemporalExtractor;pub use information_extraction::Topic;pub use ml_integration::BatchTextProcessor;pub use ml_integration::FeatureExtractionMode;pub use ml_integration::MLTextPreprocessor;pub use ml_integration::TextFeatures;pub use ml_integration::TextMLPipeline;pub use ml_sentiment::ClassMetrics;pub use ml_sentiment::EvaluationMetrics;pub use ml_sentiment::MLSentimentAnalyzer;pub use ml_sentiment::MLSentimentConfig;pub use ml_sentiment::TrainingMetrics;pub use model_registry::ModelMetadata;pub use model_registry::ModelRegistry;pub use model_registry::ModelType;pub use model_registry::PrebuiltModels;pub use model_registry::RegistrableModel;pub use model_registry::SerializableModelData;pub use multilingual::Language;pub use multilingual::LanguageDetectionResult;pub use multilingual::LanguageDetector;pub use multilingual::MultilingualProcessor;pub use multilingual::ProcessedText;pub use multilingual::StopWords;pub use neural_architectures::ActivationFunction;pub use neural_architectures::AdditiveAttention;pub use neural_architectures::BiLSTM;pub use neural_architectures::CNNLSTMHybrid;pub use neural_architectures::Conv1D;pub use neural_architectures::CrossAttention;pub use neural_architectures::Dropout;pub use neural_architectures::GRUCell;pub use neural_architectures::LSTMCell;pub use neural_architectures::LayerNorm as NeuralLayerNorm;pub use neural_architectures::MaxPool1D;pub use neural_architectures::MultiHeadAttention as NeuralMultiHeadAttention;pub use neural_architectures::MultiScaleCNN;pub use neural_architectures::PositionwiseFeedForward;pub use neural_architectures::ResidualBlock1D;pub use neural_architectures::SelfAttention;pub use neural_architectures::TextCNN;pub use parallel::ParallelCorpusProcessor;pub use parallel::ParallelTextProcessor;pub use parallel::ParallelTokenizer;pub use parallel::ParallelVectorizer;pub use performance::AdvancedPerformanceMonitor;pub use performance::DetailedPerformanceReport;pub use performance::OptimizationRecommendation;pub use performance::PerformanceSummary;pub use performance::PerformanceThresholds;pub use pos_tagging::PosAwareLemmatizer;pub use pos_tagging::PosTagResult;pub use pos_tagging::PosTagger;pub use pos_tagging::PosTaggerConfig;pub use pos_tagging::PosTaggingResult;pub use preprocess::BasicNormalizer;pub use preprocess::BasicTextCleaner;pub use preprocess::TextCleaner;pub use preprocess::TextNormalizer;pub use semantic_similarity::LcsSimilarity;pub use semantic_similarity::SemanticSimilarityEnsemble;pub use semantic_similarity::SoftCosineSimilarity;pub use semantic_similarity::WeightedJaccard;pub use semantic_similarity::WordMoversDistance;pub use sentiment::LexiconSentimentAnalyzer;pub use sentiment::RuleBasedSentimentAnalyzer;pub use sentiment::Sentiment;pub use sentiment::SentimentLexicon;pub use sentiment::SentimentResult;pub use sentiment::SentimentRules;pub use sentiment::SentimentWordCounts;pub use simd_ops::AdvancedSIMDTextProcessor;pub use simd_ops::SimdEditDistance;pub use simd_ops::SimdStringOps;pub use simd_ops::SimdTextAnalyzer;pub use simd_ops::TextProcessingResult;pub use sparse::CsrMatrix;pub use sparse::DokMatrix;pub use sparse::SparseMatrixBuilder;pub use sparse::SparseVector;pub use sparse_vectorize::sparse_cosine_similarity;pub use sparse_vectorize::MemoryStats;pub use sparse_vectorize::SparseCountVectorizer;pub use sparse_vectorize::SparseTfidfVectorizer;pub use spelling::DictionaryCorrector;pub use spelling::DictionaryCorrectorConfig;pub use spelling::EditOp;pub use spelling::ErrorModel;pub use spelling::NGramModel;pub use spelling::SpellingCorrector;pub use spelling::StatisticalCorrector;pub use spelling::StatisticalCorrectorConfig;pub use stemming::LancasterStemmer;pub use stemming::LemmatizerConfig;pub use stemming::PorterStemmer;pub use stemming::PosTag;pub use stemming::RuleLemmatizer;pub use stemming::RuleLemmatizerBuilder;pub use stemming::SimpleLemmatizer;pub use stemming::SnowballStemmer;pub use stemming::Stemmer;pub use streaming::AdvancedStreamingMetrics;pub use streaming::AdvancedStreamingProcessor;pub use streaming::ChunkedCorpusReader;pub use streaming::MemoryMappedCorpus;pub use streaming::ProgressTracker;pub use streaming::StreamingTextProcessor;pub use streaming::StreamingVectorizer;pub use string_metrics::AlignmentResult;pub use string_metrics::DamerauLevenshteinMetric;pub use string_metrics::Metaphone;pub use string_metrics::NeedlemanWunsch;pub use string_metrics::Nysiis;pub use string_metrics::PhoneticAlgorithm;pub use string_metrics::SmithWaterman;pub use string_metrics::Soundex;pub use string_metrics::StringMetric;pub use summarization::CentroidSummarizer;pub use summarization::KeywordExtractor;pub use summarization::TextRank;pub use text_coordinator::AdvancedBatchClassificationResult;pub use text_coordinator::AdvancedSemanticSimilarityResult;pub use text_coordinator::AdvancedTextConfig;pub use text_coordinator::AdvancedTextCoordinator;pub use text_coordinator::AdvancedTextResult;pub use text_coordinator::AdvancedTopicModelingResult;pub use text_statistics::ReadabilityMetrics;pub use text_statistics::TextMetrics;pub use text_statistics::TextStatistics;pub use token_filter::CompositeFilter;pub use token_filter::CustomFilter;pub use token_filter::FrequencyFilter;pub use token_filter::LengthFilter;pub use token_filter::RegexFilter;pub use token_filter::StopwordsFilter;pub use token_filter::TokenFilter;pub use tokenize::bpe::BpeConfig;pub use tokenize::bpe::BpeTokenizer;pub use tokenize::bpe::BpeVocabulary;pub use tokenize::CharacterTokenizer;pub use tokenize::NgramTokenizer;pub use tokenize::RegexTokenizer;pub use tokenize::SentenceTokenizer;pub use tokenize::Tokenizer;pub use tokenize::WhitespaceTokenizer;pub use tokenize::WordTokenizer;pub use topic_coherence::TopicCoherence;pub use topic_coherence::TopicDiversity;pub use topic_modeling::LatentDirichletAllocation;pub use topic_modeling::LdaBuilder;pub use topic_modeling::LdaConfig;pub use topic_modeling::LdaLearningMethod;pub use topic_modeling::Topic as LdaTopic;pub use transformer::FeedForward;pub use transformer::LayerNorm;pub use transformer::MultiHeadAttention;pub use transformer::PositionalEncoding;pub use transformer::TokenEmbedding;pub use transformer::TransformerConfig;pub use transformer::TransformerDecoder;pub use transformer::TransformerDecoderLayer;pub use transformer::TransformerEncoder;pub use transformer::TransformerEncoderLayer;pub use transformer::TransformerModel;pub use vectorize::CountVectorizer;pub use vectorize::TfidfVectorizer;pub use vectorize::Vectorizer;pub use visualization::AttentionVisualizer;pub use visualization::Color;pub use visualization::ColorScheme;pub use visualization::EmbeddingVisualizer;pub use visualization::SentimentVisualizer;pub use visualization::TextAnalyticsDashboard;pub use visualization::TopicVisualizer;pub use visualization::VisualizationConfig;pub use visualization::WordCloud;pub use vocabulary::Vocabulary;pub use weighted_distance::DamerauLevenshteinWeights;pub use weighted_distance::LevenshteinWeights;pub use weighted_distance::WeightedDamerauLevenshtein;pub use weighted_distance::WeightedLevenshtein;pub use weighted_distance::WeightedStringMetric;
Modulesยง
- classification
- Text classification functionality
- cleansing
- Advanced text cleansing utilities
- distance
- Text distance and similarity measures
- domain_
processors - Domain-specific text processors for specialized fields
- embeddings
- Word Embeddings Module
- enhanced_
vectorize - Enhanced text vectorization with n-gram support
- error
- Error types for the text processing module
- huggingface_
compat - Hugging Face compatibility layer for interoperability
- information_
extraction - Information extraction utilities for structured data extraction from text
- ml_
integration - Integration with machine learning modules
- ml_
sentiment - Machine learning based sentiment analysis
- model_
registry - Pre-trained model registry for managing and loading text processing models
- multilingual
- Multilingual text processing and language detection
- neural_
architectures - Advanced neural architectures for text processing
- parallel
- Parallel processing utilities for text
- performance
- Advanced Performance Monitoring and Optimization
- pos_
tagging - Part-of-Speech (POS) tagging for English text
- preprocess
- Text preprocessing utilities
- semantic_
similarity - Advanced semantic similarity measures for text analysis
- sentiment
- Sentiment analysis functionality
- simd_
ops - SIMD-accelerated string operations for text processing
- sparse
- Sparse matrix representations for memory-efficient text processing
- sparse_
vectorize - Sparse vectorization for memory-efficient text representation
- spelling
- Spelling correction algorithms
- stemming
- Text stemming algorithms
- streaming
- Memory-efficient streaming and memory-mapped text processing
- string_
metrics - String metrics module for distance calculations and phonetic algorithms.
- summarization
- Text summarization module
- text_
coordinator - Advanced Text Processing Coordinator
- text_
statistics - Text statistics module for readability and text complexity metrics.
- token_
filter - Token filtering functionality
- tokenize
- Text tokenization utilities
- topic_
coherence - Topic coherence metrics for evaluating topic models
- topic_
modeling - Topic Modeling Module
- transformer
- Transformer Architecture Module
- utils
- Utility functions for text processing
- vectorize
- Text vectorization utilities
- visualization
- Visualization tools for text processing and analysis
- vocabulary
- Vocabulary management for text processing
- weighted_
distance - Weighted distance metrics for string comparison.