Skip to main content

Crate scirs2_text

Crate scirs2_text 

Source
Expand description

ยงSciRS2 Text - Natural Language Processing

scirs2-text provides comprehensive text processing and NLP capabilities, offering tokenization, TF-IDF vectorization, word embeddings, sentiment analysis, topic modeling, and text classification with SIMD acceleration and parallel processing.

ยง๐ŸŽฏ Key Features

  • Tokenization: Word, sentence, N-gram, BPE, regex tokenizers
  • Vectorization: TF-IDF, count vectorizers, word embeddings
  • Text Processing: Stemming, lemmatization, normalization, stopword removal
  • Embeddings: Word2Vec (Skip-gram, CBOW), GloVe loading
  • Similarity: Cosine, Jaccard, Levenshtein, phonetic algorithms
  • NLP: Sentiment analysis, topic modeling (LDA), text classification
  • Performance: SIMD operations, parallel processing, sparse matrices

ยง๐Ÿ“ฆ Module Overview

SciRS2 ModulePython EquivalentDescription
tokenizenltk.tokenizeText tokenization utilities
vectorizesklearn.feature_extraction.text.TfidfVectorizerTF-IDF and count vectorization
embeddingsgensim.models.Word2VecWord embeddings (Word2Vec)
sentimentnltk.sentimentSentiment analysis
topic_modelingsklearn.decomposition.LatentDirichletAllocationTopic modeling (LDA)
stemmingnltk.stemStemming and lemmatization

ยง๐Ÿš€ Quick Start

[dependencies]
scirs2-text = "0.4.0"
use scirs2_text::{tokenize::WordTokenizer, vectorize::TfidfVectorizer, Tokenizer, Vectorizer};

// Tokenization
let tokenizer = WordTokenizer::default();
let tokens = tokenizer.tokenize("Hello, world!").unwrap();

// TF-IDF vectorization
let docs = vec!["Hello world", "Good morning world"];
let mut vectorizer = TfidfVectorizer::new(false, true, Some("l2".to_string()));
let matrix = vectorizer.fit_transform(&docs).unwrap();

ยง๐Ÿ”’ Version: 0.1.5 (January 15, 2026)

ยงQuick Start

use scirs2_text::{
    tokenize::WordTokenizer,
    vectorize::TfidfVectorizer,
    sentiment::LexiconSentimentAnalyzer,
    Tokenizer, Vectorizer
};

// Basic tokenization
let tokenizer = WordTokenizer::default();
let tokens = tokenizer.tokenize("Hello, world! This is a test.").unwrap();

// TF-IDF vectorization
let documents = vec![
    "The quick brown fox jumps over the lazy dog",
    "A quick brown dog outpaces a quick fox",
    "The lazy dog sleeps all day"
];
let mut vectorizer = TfidfVectorizer::new(false, true, Some("l2".to_string()));
let matrix = vectorizer.fit_transform(&documents).unwrap();

// Sentiment analysis
let analyzer = LexiconSentimentAnalyzer::with_basiclexicon();
let sentiment = analyzer.analyze("I love this library!").unwrap();
println!("Sentiment: {:?}", sentiment.sentiment);

ยงArchitecture

The module is organized into focused sub-modules:

ยงPerformance

SciRS2 Text is designed for high performance:

  • SIMD acceleration for string operations
  • Parallel processing for large document collections
  • Memory-efficient sparse matrix representations
  • Zero-copy string processing where possible
  • Optimized algorithms with complexity guarantees

Re-exportsยง

pub use classification::cross_validate_nb;
pub use classification::BernoulliNaiveBayes;
pub use classification::CrossValidationResult;
pub use classification::FeatureHasher;
pub use classification::FoldResult;
pub use classification::MultiLabelClassifier;
pub use classification::MultiLabelPrediction;
pub use classification::MultinomialNaiveBayes;
pub use classification::TextClassificationMetrics;
pub use classification::TextClassificationPipeline;
pub use classification::TextDataset;
pub use classification::TextFeatureSelector;
pub use classification::TfidfCosineClassifier;
pub use cleansing::expand_contractions;
pub use cleansing::normalize_currencies;
pub use cleansing::normalize_numbers;
pub use cleansing::normalize_ordinals;
pub use cleansing::normalize_percentages;
pub use cleansing::normalize_unicode;
pub use cleansing::normalize_whitespace;
pub use cleansing::remove_accents;
pub use cleansing::replace_emails;
pub use cleansing::replace_urls;
pub use cleansing::strip_html_tags;
pub use cleansing::AdvancedTextCleaner;
pub use distance::cosine_similarity;
pub use distance::jaccard_similarity;
pub use distance::levenshtein_distance;
pub use domain_processors::Domain;
pub use domain_processors::DomainProcessorConfig;
pub use domain_processors::FinancialTextProcessor;
pub use domain_processors::LegalTextProcessor;
pub use domain_processors::MedicalTextProcessor;
pub use domain_processors::NewsTextProcessor;
pub use domain_processors::PatentTextProcessor;
pub use domain_processors::ProcessedDomainText;
pub use domain_processors::ScientificTextProcessor;
pub use domain_processors::SocialMediaTextProcessor;
pub use domain_processors::UnifiedDomainProcessor;
pub use embeddings::embedding_cosine_similarity;
pub use embeddings::fasttext::FastText;
pub use embeddings::fasttext::FastTextConfig;
pub use embeddings::glove::CooccurrenceMatrix;
pub use embeddings::glove::GloVe;
pub use embeddings::glove::GloVeTrainer;
pub use embeddings::glove::GloVeTrainerConfig;
pub use embeddings::pairwise_similarity;
pub use embeddings::Word2Vec;
pub use embeddings::Word2VecAlgorithm;
pub use embeddings::Word2VecConfig;
pub use embeddings::WordEmbedding;
pub use enhanced_vectorize::EnhancedCountVectorizer;
pub use enhanced_vectorize::EnhancedTfidfVectorizer;
pub use error::Result;
pub use error::TextError;
pub use huggingface_compat::ClassificationResult;
pub use huggingface_compat::FeatureExtractionPipeline;
pub use huggingface_compat::FillMaskPipeline;
pub use huggingface_compat::FillMaskResult;
pub use huggingface_compat::FormatConverter;
pub use huggingface_compat::HfConfig;
pub use huggingface_compat::HfEncodedInput;
pub use huggingface_compat::HfHub;
pub use huggingface_compat::HfModelAdapter;
pub use huggingface_compat::HfPipeline;
pub use huggingface_compat::HfTokenizer;
pub use huggingface_compat::HfTokenizerConfig;
pub use huggingface_compat::QuestionAnsweringPipeline;
pub use huggingface_compat::QuestionAnsweringResult;
pub use huggingface_compat::TextClassificationPipeline as HfTextClassificationPipeline;
pub use huggingface_compat::ZeroShotClassificationPipeline;
pub use information_extraction::AdvancedExtractedInformation;
pub use information_extraction::AdvancedExtractionPipeline;
pub use information_extraction::ConfidenceScorer;
pub use information_extraction::CoreferenceChain;
pub use information_extraction::CoreferenceMention;
pub use information_extraction::CoreferenceResolver;
pub use information_extraction::DocumentInformationExtractor;
pub use information_extraction::DocumentSummary;
pub use information_extraction::Entity;
pub use information_extraction::EntityCluster;
pub use information_extraction::EntityLinker;
pub use information_extraction::EntityType;
pub use information_extraction::Event;
pub use information_extraction::ExtractedInformation;
pub use information_extraction::InformationExtractionPipeline;
pub use information_extraction::KeyPhraseExtractor;
pub use information_extraction::KnowledgeBaseEntry;
pub use information_extraction::LinkedEntity;
pub use information_extraction::MentionType;
pub use information_extraction::PatternExtractor;
pub use information_extraction::Relation;
pub use information_extraction::RelationExtractor;
pub use information_extraction::RuleBasedNER;
pub use information_extraction::StructuredDocumentInformation;
pub use information_extraction::TemporalExtractor;
pub use information_extraction::Topic;
pub use language_model::NgramModel;
pub use language_model::SmoothingMethod;
pub use lemmatization::Lemmatizer;
pub use lemmatization::RuleBasedLemmatizer;
pub use lemmatization::WordNetLemmatizer;
pub use ml_integration::BatchTextProcessor;
pub use ml_integration::FeatureExtractionMode;
pub use ml_integration::MLTextPreprocessor;
pub use ml_integration::TextFeatures;
pub use ml_integration::TextMLPipeline;
pub use ml_sentiment::ClassMetrics;
pub use ml_sentiment::EvaluationMetrics;
pub use ml_sentiment::MLSentimentAnalyzer;
pub use ml_sentiment::MLSentimentConfig;
pub use ml_sentiment::TrainingMetrics;
pub use model_registry::ModelMetadata;
pub use model_registry::ModelRegistry;
pub use model_registry::ModelType;
pub use model_registry::PrebuiltModels;
pub use model_registry::RegistrableModel;
pub use model_registry::SerializableModelData;
pub use multilingual::Language;
pub use multilingual::LanguageDetectionResult;
pub use multilingual::LanguageDetector;
pub use multilingual::MultilingualProcessor;
pub use multilingual::ProcessedText;
pub use multilingual::StopWords;
pub use neural_architectures::ActivationFunction;
pub use neural_architectures::AdditiveAttention;
pub use neural_architectures::BiLSTM;
pub use neural_architectures::CNNLSTMHybrid;
pub use neural_architectures::Conv1D;
pub use neural_architectures::CrossAttention;
pub use neural_architectures::Dropout;
pub use neural_architectures::GRUCell;
pub use neural_architectures::LSTMCell;
pub use neural_architectures::LayerNorm as NeuralLayerNorm;
pub use neural_architectures::MaxPool1D;
pub use neural_architectures::MultiHeadAttention as NeuralMultiHeadAttention;
pub use neural_architectures::MultiScaleCNN;
pub use neural_architectures::PositionwiseFeedForward;
pub use neural_architectures::ResidualBlock1D;
pub use neural_architectures::SelfAttention;
pub use neural_architectures::TextCNN;
pub use parallel::ParallelCorpusProcessor;
pub use parallel::ParallelTextProcessor;
pub use parallel::ParallelTokenizer;
pub use parallel::ParallelVectorizer;
pub use paraphrasing::ParaphraseConfig;
pub use paraphrasing::ParaphraseResult;
pub use paraphrasing::ParaphraseStrategy;
pub use paraphrasing::Paraphraser;
pub use performance::AdvancedPerformanceMonitor;
pub use performance::DetailedPerformanceReport;
pub use performance::OptimizationRecommendation;
pub use performance::PerformanceSummary;
pub use performance::PerformanceThresholds;
pub use pipeline::basic_pipeline;
pub use pipeline::lemmatization_pipeline;
pub use pipeline::ngram_pipeline;
pub use pipeline::stemming_pipeline;
pub use pipeline::BatchProcessor;
pub use pipeline::NlpPipeline;
pub use pipeline::PipelineBuilder;
pub use pipeline::PipelineStep;
pub use pos_tagging::PosAwareLemmatizer;
pub use pos_tagging::PosTagResult;
pub use pos_tagging::PosTagger;
pub use pos_tagging::PosTaggerConfig;
pub use pos_tagging::PosTaggingResult;
pub use preprocess::BasicNormalizer;
pub use preprocess::BasicTextCleaner;
pub use preprocess::TextCleaner;
pub use preprocess::TextNormalizer;
pub use semantic_similarity::LcsSimilarity;
pub use semantic_similarity::SemanticSimilarityEnsemble;
pub use semantic_similarity::SoftCosineSimilarity;
pub use semantic_similarity::WeightedJaccard;
pub use semantic_similarity::WordMoversDistance;
pub use sentiment::aggregate_sentiment;
pub use sentiment::analyze_and_aggregate;
pub use sentiment::AggregatedSentiment;
pub use sentiment::AspectSentiment;
pub use sentiment::AspectSentimentAnalyzer;
pub use sentiment::LexiconSentimentAnalyzer;
pub use sentiment::NaiveBayesSentiment;
pub use sentiment::RuleBasedSentimentAnalyzer;
pub use sentiment::Sentiment;
pub use sentiment::SentimentLexicon;
pub use sentiment::SentimentResult;
pub use sentiment::SentimentRules;
pub use sentiment::SentimentWordCounts;
pub use sentiment::VaderResult;
pub use sentiment::VaderSentimentAnalyzer;
pub use simd_ops::AdvancedSIMDTextProcessor;
pub use simd_ops::SimdEditDistance;
pub use simd_ops::SimdStringOps;
pub use simd_ops::SimdTextAnalyzer;
pub use simd_ops::TextProcessingResult;
pub use sparse::CsrMatrix;
pub use sparse::DokMatrix;
pub use sparse::SparseMatrixBuilder;
pub use sparse::SparseVector;
pub use sparse_vectorize::sparse_cosine_similarity;
pub use sparse_vectorize::MemoryStats;
pub use sparse_vectorize::SparseCountVectorizer;
pub use sparse_vectorize::SparseTfidfVectorizer;
pub use spelling::DictionaryCorrector;
pub use spelling::DictionaryCorrectorConfig;
pub use spelling::EditOp;
pub use spelling::ErrorModel;
pub use spelling::NGramModel;
pub use spelling::SpellingCorrector;
pub use spelling::StatisticalCorrector;
pub use spelling::StatisticalCorrectorConfig;
pub use stemming::LancasterStemmer;
pub use stemming::LemmatizerConfig;
pub use stemming::PorterStemmer;
pub use stemming::PosTag;
pub use stemming::RuleLemmatizer;
pub use stemming::RuleLemmatizerBuilder;
pub use stemming::SimpleLemmatizer;
pub use stemming::SnowballStemmer;
pub use stemming::Stemmer;
pub use streaming::AdvancedStreamingMetrics;
pub use streaming::AdvancedStreamingProcessor;
pub use streaming::ChunkedCorpusReader;
pub use streaming::MemoryMappedCorpus;
pub use streaming::ProgressTracker;
pub use streaming::StreamingTextProcessor;
pub use streaming::StreamingVectorizer;
pub use string_metrics::AlignmentResult;
pub use string_metrics::DamerauLevenshteinMetric;
pub use string_metrics::Metaphone;
pub use string_metrics::NeedlemanWunsch;
pub use string_metrics::Nysiis;
pub use string_metrics::PhoneticAlgorithm;
pub use string_metrics::SmithWaterman;
pub use string_metrics::Soundex;
pub use string_metrics::StringMetric;
pub use summarization::CentroidSummarizer;
pub use summarization::KeywordExtractor;
pub use summarization::TextRank;
pub use text_coordinator::AdvancedBatchClassificationResult;
pub use text_coordinator::AdvancedSemanticSimilarityResult;
pub use text_coordinator::AdvancedTextConfig;
pub use text_coordinator::AdvancedTextCoordinator;
pub use text_coordinator::AdvancedTextResult;
pub use text_coordinator::AdvancedTopicModelingResult;
pub use text_statistics::ReadabilityMetrics;
pub use text_statistics::TextMetrics;
pub use text_statistics::TextStatistics;
pub use token_filter::CompositeFilter;
pub use token_filter::CustomFilter;
pub use token_filter::FrequencyFilter;
pub use token_filter::LengthFilter;
pub use token_filter::RegexFilter;
pub use token_filter::StopwordsFilter;
pub use token_filter::TokenFilter;
pub use tokenize::bpe::BpeConfig;
pub use tokenize::bpe::BpeTokenizer;
pub use tokenize::bpe::BpeVocabulary;
pub use tokenize::CharacterTokenizer;
pub use tokenize::NgramTokenizer;
pub use tokenize::RegexTokenizer;
pub use tokenize::SentenceTokenizer;
pub use tokenize::Tokenizer;
pub use tokenize::WhitespaceTokenizer;
pub use tokenize::WordTokenizer;
pub use tokenizer::BPETokenizer;
pub use tokenizer::SimpleCharTokenizer;
pub use tokenizer::SimpleWhitespaceTokenizer;
pub use tokenizer::TransformerTokenizer;
pub use tokenizer::WordPieceTokenizer;
pub use topic_coherence::TopicCoherence;
pub use topic_coherence::TopicDiversity;
pub use topic_modeling::LatentDirichletAllocation;
pub use topic_modeling::LdaBuilder;
pub use topic_modeling::LdaConfig;
pub use topic_modeling::LdaLearningMethod;
pub use topic_modeling::Topic as LdaTopic;
pub use transformer::FeedForward;
pub use transformer::LayerNorm;
pub use transformer::MultiHeadAttention;
pub use transformer::PositionalEncoding;
pub use transformer::TokenEmbedding;
pub use transformer::TransformerConfig;
pub use transformer::TransformerDecoder;
pub use transformer::TransformerDecoderLayer;
pub use transformer::TransformerEncoder;
pub use transformer::TransformerEncoderLayer;
pub use transformer::TransformerModel;
pub use vectorize::CountVectorizer;
pub use vectorize::TfidfVectorizer;
pub use vectorize::Vectorizer;
pub use visualization::AttentionVisualizer;
pub use visualization::Color;
pub use visualization::ColorScheme;
pub use visualization::EmbeddingVisualizer;
pub use visualization::SentimentVisualizer;
pub use visualization::TextAnalyticsDashboard;
pub use visualization::TopicVisualizer;
pub use visualization::VisualizationConfig;
pub use visualization::WordCloud;
pub use vocabulary::Vocabulary;
pub use weighted_distance::DamerauLevenshteinWeights;
pub use weighted_distance::LevenshteinWeights;
pub use weighted_distance::WeightedDamerauLevenshtein;
pub use weighted_distance::WeightedLevenshtein;
pub use weighted_distance::WeightedStringMetric;
pub use keyword_extraction::extract_keywords;
pub use keyword_extraction::Keyword;
pub use keyword_extraction::KeywordMethod;
pub use keyword_extraction::RakeKeywordExtractor;
pub use keyword_extraction::TextRankKeywordExtractor;
pub use keyword_extraction::TfIdfKeywordExtractor;
pub use language_detection::detect_language;
pub use language_detection::detect_language_with_strategy;
pub use language_detection::DetectedLanguage;
pub use language_detection::DetectionStrategy;
pub use language_detection::LanguageDetectionOutput;
pub use named_entity_recognition::extract_entities;
pub use named_entity_recognition::NerEntity;
pub use named_entity_recognition::NerEntityType;
pub use named_entity_recognition::NerPatternConfig;
pub use text_similarity::bm25_score;
pub use text_similarity::char_ngram_jaccard_similarity;
pub use text_similarity::edit_distance_similarity;
pub use text_similarity::jaccard_token_similarity;
pub use text_similarity::text_similarity;
pub use text_similarity::tfidf_cosine_similarity;
pub use text_similarity::Bm25Config;
pub use text_similarity::Bm25Scorer;
pub use text_similarity::SimilarityMethod;
pub use text_similarity::SimilarityResult;
pub use text_similarity::TfIdfCosineSimilarity;
pub use text_summarization::score_position;
pub use text_summarization::score_textrank;
pub use text_summarization::score_tfidf;
pub use text_summarization::summarize;
pub use text_summarization::ScoredSentence;
pub use text_summarization::SummarizationMethod;

Modulesยง

batch_tokenizer
Batch tokenization with padding and attention masks. Batch Tokenization with Padding, Truncation, and Attention Masks
bert_finetune
Lightweight BERT fine-tuning API on top of pre-computed embeddings.
classification
Text classification functionality
cleansing
Advanced text cleansing utilities
crosslingual
Cross-lingual NER transfer and language-agnostic feature utilities.
ctm
Correlated Topic Model (CTM)
distance
Text distance and similarity measures
domain_processors
Domain-specific text processors for specialized fields
dtm
Dynamic Topic Model (DTM)
embeddings
Word Embeddings Module
enhanced_vectorize
Enhanced text vectorization with n-gram support
error
Error types for the text processing module
evaluation
NLP Evaluation Metrics and Online LDA
gpt_bpe
GPT-2 byte-level BPE tokenizer. GPT-2 Byte-Level BPE Tokenizer
huggingface_compat
Hugging Face compatibility layer for interoperability
information_extraction
Information extraction utilities for structured data extraction from text
keyword_extraction
Keyword extraction module
language_detection
Language detection module
language_model
N-gram Language Models
lemmatization
Advanced Lemmatization for English text
ml_integration
Integration with machine learning modules
ml_sentiment
Machine learning based sentiment analysis
model_registry
Pre-trained model registry for managing and loading text processing models
multilingual
Multilingual text processing and language detection
named_entity_recognition
Named Entity Recognition module
neural_architectures
Advanced neural architectures for text processing
parallel
Parallel processing utilities for text
paraphrasing
Text Paraphrasing Module
performance
Advanced Performance Monitoring and Optimization
pipeline
Composable NLP Processing Pipelines
pos_tagging
Part-of-Speech (POS) tagging for English text
preprocess
Text preprocessing utilities
semantic_similarity
Advanced semantic similarity measures for text analysis
sentence_embeddings
Sentence embedding aggregation and SimCSE contrastive learning.
sentencepiece
SentencePiece Unigram Language Model tokenizer. SentencePiece Unigram Language Model Tokenizer
sentiment
Sentiment analysis functionality
simd_ops
SIMD-accelerated string operations for text processing
similarity
Semantic Similarity and Search.
sparse
Sparse matrix representations for memory-efficient text processing
sparse_vectorize
Sparse vectorization for memory-efficient text representation
spelling
Spelling correction algorithms
stemming
Text stemming algorithms
streaming
Memory-efficient streaming and memory-mapped text processing
string_metrics
String metrics module for distance calculations and phonetic algorithms.
summarization
Text summarization module
text_coordinator
Advanced Text Processing Coordinator
text_similarity
Text similarity module
text_statistics
Text statistics module for readability and text complexity metrics.
text_summarization
Extractive text summarization module
token_filter
Token filtering functionality
tokenize
Text tokenization utilities
tokenizer
Tokenization utilities for transformer models
tokenizers
Transformer tokenizers.
topic_coherence
Topic coherence metrics for evaluating topic models
topic_modeling
Topic Modeling Module
transformer
Transformer Architecture Module
transliteration
Transliteration utilities: convert non-Latin scripts to Latin characters.
utils
Utility functions for text processing
vectorize
Text vectorization utilities
visualization
Visualization tools for text processing and analysis
vocabulary
Vocabulary management for text processing
weighted_distance
Weighted distance metrics for string comparison.