Crate scirs2_text

Crate scirs2_text 

Source
Expand description

ยงSciRS2 Text - Natural Language Processing

scirs2-text provides comprehensive text processing and NLP capabilities, offering tokenization, TF-IDF vectorization, word embeddings, sentiment analysis, topic modeling, and text classification with SIMD acceleration and parallel processing.

ยง๐ŸŽฏ Key Features

  • Tokenization: Word, sentence, N-gram, BPE, regex tokenizers
  • Vectorization: TF-IDF, count vectorizers, word embeddings
  • Text Processing: Stemming, lemmatization, normalization, stopword removal
  • Embeddings: Word2Vec (Skip-gram, CBOW), GloVe loading
  • Similarity: Cosine, Jaccard, Levenshtein, phonetic algorithms
  • NLP: Sentiment analysis, topic modeling (LDA), text classification
  • Performance: SIMD operations, parallel processing, sparse matrices

ยง๐Ÿ“ฆ Module Overview

SciRS2 ModulePython EquivalentDescription
tokenizenltk.tokenizeText tokenization utilities
vectorizesklearn.feature_extraction.text.TfidfVectorizerTF-IDF and count vectorization
embeddingsgensim.models.Word2VecWord embeddings (Word2Vec)
sentimentnltk.sentimentSentiment analysis
topic_modelingsklearn.decomposition.LatentDirichletAllocationTopic modeling (LDA)
stemmingnltk.stemStemming and lemmatization

ยง๐Ÿš€ Quick Start

[dependencies]
scirs2-text = "0.1.0-rc.2"
use scirs2_text::{tokenize::WordTokenizer, vectorize::TfidfVectorizer, Tokenizer, Vectorizer};

// Tokenization
let tokenizer = WordTokenizer::default();
let tokens = tokenizer.tokenize("Hello, world!").unwrap();

// TF-IDF vectorization
let docs = vec!["Hello world", "Good morning world"];
let mut vectorizer = TfidfVectorizer::new(false, true, Some("l2".to_string()));
let matrix = vectorizer.fit_transform(&docs).unwrap();

ยง๐Ÿ”’ Version: 0.1.0-rc.2 (October 03, 2025)

ยงQuick Start

use scirs2_text::{
    tokenize::WordTokenizer,
    vectorize::TfidfVectorizer,
    sentiment::LexiconSentimentAnalyzer,
    Tokenizer, Vectorizer
};

// Basic tokenization
let tokenizer = WordTokenizer::default();
let tokens = tokenizer.tokenize("Hello, world! This is a test.").unwrap();

// TF-IDF vectorization
let documents = vec![
    "The quick brown fox jumps over the lazy dog",
    "A quick brown dog outpaces a quick fox",
    "The lazy dog sleeps all day"
];
let mut vectorizer = TfidfVectorizer::new(false, true, Some("l2".to_string()));
let matrix = vectorizer.fit_transform(&documents).unwrap();

// Sentiment analysis
let analyzer = LexiconSentimentAnalyzer::with_basiclexicon();
let sentiment = analyzer.analyze("I love this library!").unwrap();
println!("Sentiment: {:?}", sentiment.sentiment);

ยงArchitecture

The module is organized into focused sub-modules:

ยงPerformance

SciRS2 Text is designed for high performance:

  • SIMD acceleration for string operations
  • Parallel processing for large document collections
  • Memory-efficient sparse matrix representations
  • Zero-copy string processing where possible
  • Optimized algorithms with complexity guarantees

Re-exportsยง

pub use classification::TextClassificationMetrics;
pub use classification::TextClassificationPipeline;
pub use classification::TextDataset;
pub use classification::TextFeatureSelector;
pub use cleansing::expand_contractions;
pub use cleansing::normalize_currencies;
pub use cleansing::normalize_numbers;
pub use cleansing::normalize_ordinals;
pub use cleansing::normalize_percentages;
pub use cleansing::normalize_unicode;
pub use cleansing::normalize_whitespace;
pub use cleansing::remove_accents;
pub use cleansing::replace_emails;
pub use cleansing::replace_urls;
pub use cleansing::strip_html_tags;
pub use cleansing::AdvancedTextCleaner;
pub use distance::cosine_similarity;
pub use distance::jaccard_similarity;
pub use distance::levenshtein_distance;
pub use domain_processors::Domain;
pub use domain_processors::DomainProcessorConfig;
pub use domain_processors::FinancialTextProcessor;
pub use domain_processors::LegalTextProcessor;
pub use domain_processors::MedicalTextProcessor;
pub use domain_processors::NewsTextProcessor;
pub use domain_processors::PatentTextProcessor;
pub use domain_processors::ProcessedDomainText;
pub use domain_processors::ScientificTextProcessor;
pub use domain_processors::SocialMediaTextProcessor;
pub use domain_processors::UnifiedDomainProcessor;
pub use embeddings::Word2Vec;
pub use embeddings::Word2VecAlgorithm;
pub use embeddings::Word2VecConfig;
pub use enhanced_vectorize::EnhancedCountVectorizer;
pub use enhanced_vectorize::EnhancedTfidfVectorizer;
pub use error::Result;
pub use error::TextError;
pub use huggingface_compat::ClassificationResult;
pub use huggingface_compat::FeatureExtractionPipeline;
pub use huggingface_compat::FillMaskPipeline;
pub use huggingface_compat::FillMaskResult;
pub use huggingface_compat::FormatConverter;
pub use huggingface_compat::HfConfig;
pub use huggingface_compat::HfEncodedInput;
pub use huggingface_compat::HfHub;
pub use huggingface_compat::HfModelAdapter;
pub use huggingface_compat::HfPipeline;
pub use huggingface_compat::HfTokenizer;
pub use huggingface_compat::HfTokenizerConfig;
pub use huggingface_compat::QuestionAnsweringPipeline;
pub use huggingface_compat::QuestionAnsweringResult;
pub use huggingface_compat::TextClassificationPipeline as HfTextClassificationPipeline;
pub use huggingface_compat::ZeroShotClassificationPipeline;
pub use information_extraction::AdvancedExtractedInformation;
pub use information_extraction::AdvancedExtractionPipeline;
pub use information_extraction::ConfidenceScorer;
pub use information_extraction::CoreferenceChain;
pub use information_extraction::CoreferenceMention;
pub use information_extraction::CoreferenceResolver;
pub use information_extraction::DocumentInformationExtractor;
pub use information_extraction::DocumentSummary;
pub use information_extraction::Entity;
pub use information_extraction::EntityCluster;
pub use information_extraction::EntityLinker;
pub use information_extraction::EntityType;
pub use information_extraction::Event;
pub use information_extraction::ExtractedInformation;
pub use information_extraction::InformationExtractionPipeline;
pub use information_extraction::KeyPhraseExtractor;
pub use information_extraction::KnowledgeBaseEntry;
pub use information_extraction::LinkedEntity;
pub use information_extraction::MentionType;
pub use information_extraction::PatternExtractor;
pub use information_extraction::Relation;
pub use information_extraction::RelationExtractor;
pub use information_extraction::RuleBasedNER;
pub use information_extraction::StructuredDocumentInformation;
pub use information_extraction::TemporalExtractor;
pub use information_extraction::Topic;
pub use ml_integration::BatchTextProcessor;
pub use ml_integration::FeatureExtractionMode;
pub use ml_integration::MLTextPreprocessor;
pub use ml_integration::TextFeatures;
pub use ml_integration::TextMLPipeline;
pub use ml_sentiment::ClassMetrics;
pub use ml_sentiment::EvaluationMetrics;
pub use ml_sentiment::MLSentimentAnalyzer;
pub use ml_sentiment::MLSentimentConfig;
pub use ml_sentiment::TrainingMetrics;
pub use model_registry::ModelMetadata;
pub use model_registry::ModelRegistry;
pub use model_registry::ModelType;
pub use model_registry::PrebuiltModels;
pub use model_registry::RegistrableModel;
pub use model_registry::SerializableModelData;
pub use multilingual::Language;
pub use multilingual::LanguageDetectionResult;
pub use multilingual::LanguageDetector;
pub use multilingual::MultilingualProcessor;
pub use multilingual::ProcessedText;
pub use multilingual::StopWords;
pub use neural_architectures::ActivationFunction;
pub use neural_architectures::AdditiveAttention;
pub use neural_architectures::BiLSTM;
pub use neural_architectures::CNNLSTMHybrid;
pub use neural_architectures::Conv1D;
pub use neural_architectures::CrossAttention;
pub use neural_architectures::Dropout;
pub use neural_architectures::GRUCell;
pub use neural_architectures::LSTMCell;
pub use neural_architectures::LayerNorm as NeuralLayerNorm;
pub use neural_architectures::MaxPool1D;
pub use neural_architectures::MultiHeadAttention as NeuralMultiHeadAttention;
pub use neural_architectures::MultiScaleCNN;
pub use neural_architectures::PositionwiseFeedForward;
pub use neural_architectures::ResidualBlock1D;
pub use neural_architectures::SelfAttention;
pub use neural_architectures::TextCNN;
pub use parallel::ParallelCorpusProcessor;
pub use parallel::ParallelTextProcessor;
pub use parallel::ParallelTokenizer;
pub use parallel::ParallelVectorizer;
pub use performance::AdvancedPerformanceMonitor;
pub use performance::DetailedPerformanceReport;
pub use performance::OptimizationRecommendation;
pub use performance::PerformanceSummary;
pub use performance::PerformanceThresholds;
pub use pos_tagging::PosAwareLemmatizer;
pub use pos_tagging::PosTagResult;
pub use pos_tagging::PosTagger;
pub use pos_tagging::PosTaggerConfig;
pub use pos_tagging::PosTaggingResult;
pub use preprocess::BasicNormalizer;
pub use preprocess::BasicTextCleaner;
pub use preprocess::TextCleaner;
pub use preprocess::TextNormalizer;
pub use semantic_similarity::LcsSimilarity;
pub use semantic_similarity::SemanticSimilarityEnsemble;
pub use semantic_similarity::SoftCosineSimilarity;
pub use semantic_similarity::WeightedJaccard;
pub use semantic_similarity::WordMoversDistance;
pub use sentiment::LexiconSentimentAnalyzer;
pub use sentiment::RuleBasedSentimentAnalyzer;
pub use sentiment::Sentiment;
pub use sentiment::SentimentLexicon;
pub use sentiment::SentimentResult;
pub use sentiment::SentimentRules;
pub use sentiment::SentimentWordCounts;
pub use simd_ops::AdvancedSIMDTextProcessor;
pub use simd_ops::SimdEditDistance;
pub use simd_ops::SimdStringOps;
pub use simd_ops::SimdTextAnalyzer;
pub use simd_ops::TextProcessingResult;
pub use sparse::CsrMatrix;
pub use sparse::DokMatrix;
pub use sparse::SparseMatrixBuilder;
pub use sparse::SparseVector;
pub use sparse_vectorize::sparse_cosine_similarity;
pub use sparse_vectorize::MemoryStats;
pub use sparse_vectorize::SparseCountVectorizer;
pub use sparse_vectorize::SparseTfidfVectorizer;
pub use spelling::DictionaryCorrector;
pub use spelling::DictionaryCorrectorConfig;
pub use spelling::EditOp;
pub use spelling::ErrorModel;
pub use spelling::NGramModel;
pub use spelling::SpellingCorrector;
pub use spelling::StatisticalCorrector;
pub use spelling::StatisticalCorrectorConfig;
pub use stemming::LancasterStemmer;
pub use stemming::LemmatizerConfig;
pub use stemming::PorterStemmer;
pub use stemming::PosTag;
pub use stemming::RuleLemmatizer;
pub use stemming::RuleLemmatizerBuilder;
pub use stemming::SimpleLemmatizer;
pub use stemming::SnowballStemmer;
pub use stemming::Stemmer;
pub use streaming::AdvancedStreamingMetrics;
pub use streaming::AdvancedStreamingProcessor;
pub use streaming::ChunkedCorpusReader;
pub use streaming::MemoryMappedCorpus;
pub use streaming::ProgressTracker;
pub use streaming::StreamingTextProcessor;
pub use streaming::StreamingVectorizer;
pub use string_metrics::AlignmentResult;
pub use string_metrics::DamerauLevenshteinMetric;
pub use string_metrics::Metaphone;
pub use string_metrics::NeedlemanWunsch;
pub use string_metrics::Nysiis;
pub use string_metrics::PhoneticAlgorithm;
pub use string_metrics::SmithWaterman;
pub use string_metrics::Soundex;
pub use string_metrics::StringMetric;
pub use summarization::CentroidSummarizer;
pub use summarization::KeywordExtractor;
pub use summarization::TextRank;
pub use text_coordinator::AdvancedBatchClassificationResult;
pub use text_coordinator::AdvancedSemanticSimilarityResult;
pub use text_coordinator::AdvancedTextConfig;
pub use text_coordinator::AdvancedTextCoordinator;
pub use text_coordinator::AdvancedTextResult;
pub use text_coordinator::AdvancedTopicModelingResult;
pub use text_statistics::ReadabilityMetrics;
pub use text_statistics::TextMetrics;
pub use text_statistics::TextStatistics;
pub use token_filter::CompositeFilter;
pub use token_filter::CustomFilter;
pub use token_filter::FrequencyFilter;
pub use token_filter::LengthFilter;
pub use token_filter::RegexFilter;
pub use token_filter::StopwordsFilter;
pub use token_filter::TokenFilter;
pub use tokenize::bpe::BpeConfig;
pub use tokenize::bpe::BpeTokenizer;
pub use tokenize::bpe::BpeVocabulary;
pub use tokenize::CharacterTokenizer;
pub use tokenize::NgramTokenizer;
pub use tokenize::RegexTokenizer;
pub use tokenize::SentenceTokenizer;
pub use tokenize::Tokenizer;
pub use tokenize::WhitespaceTokenizer;
pub use tokenize::WordTokenizer;
pub use topic_coherence::TopicCoherence;
pub use topic_coherence::TopicDiversity;
pub use topic_modeling::LatentDirichletAllocation;
pub use topic_modeling::LdaBuilder;
pub use topic_modeling::LdaConfig;
pub use topic_modeling::LdaLearningMethod;
pub use topic_modeling::Topic as LdaTopic;
pub use transformer::FeedForward;
pub use transformer::LayerNorm;
pub use transformer::MultiHeadAttention;
pub use transformer::PositionalEncoding;
pub use transformer::TokenEmbedding;
pub use transformer::TransformerConfig;
pub use transformer::TransformerDecoder;
pub use transformer::TransformerDecoderLayer;
pub use transformer::TransformerEncoder;
pub use transformer::TransformerEncoderLayer;
pub use transformer::TransformerModel;
pub use vectorize::CountVectorizer;
pub use vectorize::TfidfVectorizer;
pub use vectorize::Vectorizer;
pub use visualization::AttentionVisualizer;
pub use visualization::Color;
pub use visualization::ColorScheme;
pub use visualization::EmbeddingVisualizer;
pub use visualization::SentimentVisualizer;
pub use visualization::TextAnalyticsDashboard;
pub use visualization::TopicVisualizer;
pub use visualization::VisualizationConfig;
pub use visualization::WordCloud;
pub use vocabulary::Vocabulary;
pub use weighted_distance::DamerauLevenshteinWeights;
pub use weighted_distance::LevenshteinWeights;
pub use weighted_distance::WeightedDamerauLevenshtein;
pub use weighted_distance::WeightedLevenshtein;
pub use weighted_distance::WeightedStringMetric;

Modulesยง

classification
Text classification functionality
cleansing
Advanced text cleansing utilities
distance
Text distance and similarity measures
domain_processors
Domain-specific text processors for specialized fields
embeddings
Word Embeddings Module
enhanced_vectorize
Enhanced text vectorization with n-gram support
error
Error types for the text processing module
huggingface_compat
Hugging Face compatibility layer for interoperability
information_extraction
Information extraction utilities for structured data extraction from text
ml_integration
Integration with machine learning modules
ml_sentiment
Machine learning based sentiment analysis
model_registry
Pre-trained model registry for managing and loading text processing models
multilingual
Multilingual text processing and language detection
neural_architectures
Advanced neural architectures for text processing
parallel
Parallel processing utilities for text
performance
Advanced Performance Monitoring and Optimization
pos_tagging
Part-of-Speech (POS) tagging for English text
preprocess
Text preprocessing utilities
semantic_similarity
Advanced semantic similarity measures for text analysis
sentiment
Sentiment analysis functionality
simd_ops
SIMD-accelerated string operations for text processing
sparse
Sparse matrix representations for memory-efficient text processing
sparse_vectorize
Sparse vectorization for memory-efficient text representation
spelling
Spelling correction algorithms
stemming
Text stemming algorithms
streaming
Memory-efficient streaming and memory-mapped text processing
string_metrics
String metrics module for distance calculations and phonetic algorithms.
summarization
Text summarization module
text_coordinator
Advanced Text Processing Coordinator
text_statistics
Text statistics module for readability and text complexity metrics.
token_filter
Token filtering functionality
tokenize
Text tokenization utilities
topic_coherence
Topic coherence metrics for evaluating topic models
topic_modeling
Topic Modeling Module
transformer
Transformer Architecture Module
utils
Utility functions for text processing
vectorize
Text vectorization utilities
visualization
Visualization tools for text processing and analysis
vocabulary
Vocabulary management for text processing
weighted_distance
Weighted distance metrics for string comparison.