scirs2-text 0.4.3

Text processing module for SciRS2 (scirs2-text)
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
#![allow(clippy::manual_strip)]
#![allow(clippy::needless_range_loop)]
#![allow(clippy::if_same_then_else)]
#![allow(clippy::cloned_ref_to_slice_refs)]
#![allow(dead_code)]
//! # SciRS2 Text - Natural Language Processing
//!
//! **scirs2-text** provides comprehensive text processing and NLP capabilities,
//! offering tokenization, TF-IDF vectorization, word embeddings, sentiment analysis,
//! topic modeling, and text classification with SIMD acceleration and parallel processing.
//!
//! ## 🎯 Key Features
//!
//! - **Tokenization**: Word, sentence, N-gram, BPE, regex tokenizers
//! - **Vectorization**: TF-IDF, count vectorizers, word embeddings
//! - **Text Processing**: Stemming, lemmatization, normalization, stopword removal
//! - **Embeddings**: Word2Vec (Skip-gram, CBOW), GloVe loading
//! - **Similarity**: Cosine, Jaccard, Levenshtein, phonetic algorithms
//! - **NLP**: Sentiment analysis, topic modeling (LDA), text classification
//! - **Performance**: SIMD operations, parallel processing, sparse matrices
//!
//! ## 📦 Module Overview
//!
//! | SciRS2 Module | Python Equivalent | Description |
//! |---------------|-------------------|-------------|
//! | `tokenize` | `nltk.tokenize` | Text tokenization utilities |
//! | `vectorize` | `sklearn.feature_extraction.text.TfidfVectorizer` | TF-IDF and count vectorization |
//! | `embeddings` | `gensim.models.Word2Vec` | Word embeddings (Word2Vec) |
//! | `sentiment` | `nltk.sentiment` | Sentiment analysis |
//! | `topic_modeling` | `sklearn.decomposition.LatentDirichletAllocation` | Topic modeling (LDA) |
//! | `stemming` | `nltk.stem` | Stemming and lemmatization |
//!
//! ## 🚀 Quick Start
//!
//! ```toml
//! [dependencies]
//! scirs2-text = "0.4.3"
//! ```
//!
//! ```rust,no_run
//! use scirs2_text::{tokenize::WordTokenizer, vectorize::TfidfVectorizer, Tokenizer, Vectorizer};
//!
//! // Tokenization
//! let tokenizer = WordTokenizer::default();
//! let tokens = tokenizer.tokenize("Hello, world!").unwrap();
//!
//! // TF-IDF vectorization
//! let docs = vec!["Hello world", "Good morning world"];
//! let mut vectorizer = TfidfVectorizer::new(false, true, Some("l2".to_string()));
//! let matrix = vectorizer.fit_transform(&docs).unwrap();
//! ```
//!
//! ## 🔒 Version: 0.4.3 (March 27, 2026)
//!
//! ## Quick Start
//!
//! ```rust
//! use scirs2_text::{
//!     tokenize::WordTokenizer,
//!     vectorize::TfidfVectorizer,
//!     sentiment::LexiconSentimentAnalyzer,
//!     Tokenizer, Vectorizer
//! };
//!
//! // Basic tokenization
//! let tokenizer = WordTokenizer::default();
//! let tokens = tokenizer.tokenize("Hello, world! This is a test.").unwrap();
//!
//! // TF-IDF vectorization
//! let documents = vec![
//!     "The quick brown fox jumps over the lazy dog",
//!     "A quick brown dog outpaces a quick fox",
//!     "The lazy dog sleeps all day"
//! ];
//! let mut vectorizer = TfidfVectorizer::new(false, true, Some("l2".to_string()));
//! let matrix = vectorizer.fit_transform(&documents).unwrap();
//!
//! // Sentiment analysis
//! let analyzer = LexiconSentimentAnalyzer::with_basiclexicon();
//! let sentiment = analyzer.analyze("I love this library!").unwrap();
//! println!("Sentiment: {:?}", sentiment.sentiment);
//! ```
//!
//! ## Architecture
//!
//! The module is organized into focused sub-modules:
//!
//! - [`tokenize`]: Text tokenization utilities
//! - [`vectorize`]: Document vectorization and TF-IDF
//! - [`embeddings`]: Word embedding training and utilities
//! - [`sentiment`]: Sentiment analysis tools
//! - [`topic_modeling`]: Topic modeling with LDA
//! - [`string_metrics`]: String similarity and distance metrics
//! - [`preprocess`]: Text cleaning and normalization
//! - [`stemming`]: Stemming and lemmatization
//! - [`parallel`]: Parallel processing utilities
//! - [`simd_ops`]: SIMD-accelerated operations
//!
//! ## Performance
//!
//! SciRS2 Text is designed for high performance:
//!
//! - SIMD acceleration for string operations
//! - Parallel processing for large document collections
//! - Memory-efficient sparse matrix representations
//! - Zero-copy string processing where possible
//! - Optimized algorithms with complexity guarantees

#![warn(missing_docs)]

/// Batch tokenization with padding and attention masks.
pub mod batch_tokenizer;
pub mod classification;
pub mod cleansing;
pub mod distance;
pub mod domain_processors;
pub mod embeddings;
pub mod enhanced_vectorize;
pub mod error;
pub mod evaluation;
/// GPT-2 byte-level BPE tokenizer.
pub mod gpt_bpe;
pub mod huggingface_compat;
pub mod information_extraction;
pub mod language_model;
/// Statistical language models (Unigram, Bigram, N-gram with Kneser-Ney).
pub mod language_models;
pub mod lemmatization;
pub mod ml_integration;
pub mod ml_sentiment;
pub mod model_registry;
pub mod multilingual;
pub mod neural_architectures;
pub mod parallel;
pub mod paraphrasing;
pub mod performance;
pub mod pipeline;
pub mod pos_tagging;
pub mod preprocess;
pub mod semantic_similarity;
/// SentencePiece Unigram Language Model tokenizer.
pub mod sentencepiece;
pub mod sentiment;
pub mod simd_ops;
pub mod sparse;
pub mod sparse_vectorize;
pub mod spelling;
pub mod stemming;
pub mod streaming;
pub mod string_metrics;
pub mod summarization;
pub mod text_coordinator;
pub mod text_statistics;
pub mod token_filter;
pub mod tokenize;
pub mod tokenizer;
pub mod topic_coherence;
pub mod topic_modeling;
pub mod transformer;
pub mod utils;
pub mod vectorize;
pub mod visualization;
pub mod vocabulary;
pub mod weighted_distance;

// New text processing modules
pub mod keyword_extraction;
pub mod language_detection;
pub mod named_entity_recognition;
pub mod text_similarity;
pub mod text_summarization;

// BERT fine-tuning
pub mod bert_finetune;
// Cross-lingual NER and transliteration
pub mod crosslingual;
// Correlated topic model
pub mod ctm;
// Dynamic topic model
pub mod dtm;
// Hierarchical Dirichlet Process topic model
pub mod hdp;
// Sentence embeddings
pub mod sentence_embeddings;
// Semantic similarity with embedding-based search
pub mod similarity;
// Advanced tokenization algorithms (BPE, WordPiece, language-agnostic Unicode)
pub mod tokenization;
// Tokenizer implementations (HuggingFace, byte-level BPE, Unicode)
pub mod tokenizers;
// Transliteration
pub mod transliteration;
// Topic modelling (HDP automatic topic selection)
pub mod topic;

// Additional text processing modules
pub mod abstractive_summary;
pub mod advanced_classification;
pub mod advanced_distance;
pub mod alignment;
pub mod bpe_tokenizer;
pub mod coreference;
pub mod dialog;
pub mod discourse;
pub mod doc_similarity;
pub mod event_extraction;
pub mod keywords;
pub mod multilingual_ext;
pub mod ner;
pub mod pos_tagging_original;
// TODO: pos_tagging_original_tests uses the pre-refactor API (single-arg lemmatize,
// remove_word, add_lemma, add_exception, MorphologicalAnalyzer::analyze, etc.) which
// no longer matches the current pos_tagging module. Re-enable once updated to new API.
// #[cfg(test)]
// pub mod pos_tagging_original_tests;
pub mod question_answering;
pub mod regex_lite;
pub mod segmentation;
pub mod summarize_advanced;
pub mod text_classification;
pub mod text_preprocess;
pub mod topic_model;

// Re-export commonly used items
pub use classification::{
    cross_validate_nb, BernoulliNaiveBayes, CrossValidationResult, FeatureHasher, FoldResult,
    MultiLabelClassifier, MultiLabelPrediction, MultinomialNaiveBayes, TextClassificationMetrics,
    TextClassificationPipeline, TextDataset, TextFeatureSelector, TfidfCosineClassifier,
};
pub use cleansing::{
    expand_contractions, normalize_currencies, normalize_numbers, normalize_ordinals,
    normalize_percentages, normalize_unicode, normalize_whitespace, remove_accents, replace_emails,
    replace_urls, strip_html_tags, AdvancedTextCleaner,
};
pub use distance::{cosine_similarity, jaccard_similarity, levenshtein_distance};
pub use domain_processors::{
    Domain, DomainProcessorConfig, FinancialTextProcessor, LegalTextProcessor,
    MedicalTextProcessor, NewsTextProcessor, PatentTextProcessor, ProcessedDomainText,
    ScientificTextProcessor, SocialMediaTextProcessor, UnifiedDomainProcessor,
};
pub use embeddings::{
    embedding_cosine_similarity,
    fasttext::{FastText, FastTextConfig},
    glove::{CooccurrenceMatrix, GloVe, GloVeTrainer, GloVeTrainerConfig},
    pairwise_similarity, Word2Vec, Word2VecAlgorithm, Word2VecConfig, WordEmbedding,
};
pub use enhanced_vectorize::{EnhancedCountVectorizer, EnhancedTfidfVectorizer};
pub use error::{Result, TextError};
pub use huggingface_compat::{
    ClassificationResult, FeatureExtractionPipeline, FillMaskPipeline, FillMaskResult,
    FormatConverter, HfConfig, HfEncodedInput, HfHub, HfModelAdapter, HfPipeline, HfTokenizer,
    HfTokenizerConfig, QuestionAnsweringPipeline, QuestionAnsweringResult,
    TextClassificationPipeline as HfTextClassificationPipeline, ZeroShotClassificationPipeline,
};
pub use information_extraction::{
    AdvancedExtractedInformation, AdvancedExtractionPipeline, ConfidenceScorer, CoreferenceChain,
    CoreferenceMention, CoreferenceResolver, DocumentInformationExtractor, DocumentSummary, Entity,
    EntityCluster, EntityLinker, EntityType, Event, ExtractedInformation,
    InformationExtractionPipeline, KeyPhraseExtractor, KnowledgeBaseEntry, LinkedEntity,
    MentionType, PatternExtractor, Relation, RelationExtractor, RuleBasedNER,
    StructuredDocumentInformation, TemporalExtractor, Topic,
};
pub use language_model::{NgramModel, SmoothingMethod};
pub use lemmatization::{Lemmatizer, RuleBasedLemmatizer, WordNetLemmatizer};
pub use ml_integration::{
    BatchTextProcessor, FeatureExtractionMode, MLTextPreprocessor, TextFeatures, TextMLPipeline,
};
pub use ml_sentiment::{
    ClassMetrics, EvaluationMetrics, MLSentimentAnalyzer, MLSentimentConfig, TrainingMetrics,
};
pub use model_registry::{
    ModelMetadata, ModelRegistry, ModelType, PrebuiltModels, RegistrableModel,
    SerializableModelData,
};
pub use multilingual::{
    is_cjk_char, is_combining_mark, is_cyrillic, Language, LanguageDetectionResult,
    LanguageDetector, MultilingualProcessor, ProcessedText, ScriptFamily, StopWords,
    Transliterator as MultilingualTransliterator, UnicodeTokenizer, UnicodeTokenizerConfig,
};
pub use neural_architectures::{
    ActivationFunction, AdditiveAttention, BiLSTM, CNNLSTMHybrid, Conv1D, CrossAttention, Dropout,
    GRUCell, LSTMCell, LayerNorm as NeuralLayerNorm, MaxPool1D,
    MultiHeadAttention as NeuralMultiHeadAttention, MultiScaleCNN, PositionwiseFeedForward,
    ResidualBlock1D, SelfAttention, TextCNN,
};
pub use parallel::{
    ParallelCorpusProcessor, ParallelTextProcessor, ParallelTokenizer, ParallelVectorizer,
};
pub use paraphrasing::{ParaphraseConfig, ParaphraseResult, ParaphraseStrategy, Paraphraser};
pub use performance::{
    AdvancedPerformanceMonitor, DetailedPerformanceReport, OptimizationRecommendation,
    PerformanceSummary, PerformanceThresholds,
};
pub use pipeline::{
    basic_pipeline, lemmatization_pipeline, ngram_pipeline, stemming_pipeline, BatchProcessor,
    NlpPipeline, PipelineBuilder, PipelineStep,
};
pub use pos_tagging::{
    PosAwareLemmatizer, PosTagResult, PosTagger, PosTaggerConfig, PosTaggingResult,
};
pub use preprocess::{BasicNormalizer, BasicTextCleaner, TextCleaner, TextNormalizer};
pub use semantic_similarity::{
    LcsSimilarity, SemanticSimilarityEnsemble, SoftCosineSimilarity, WeightedJaccard,
    WordMoversDistance,
};
pub use sentiment::{
    aggregate_sentiment, analyze_and_aggregate, AggregatedSentiment, AspectSentiment,
    AspectSentimentAnalyzer, LexiconSentimentAnalyzer, NaiveBayesSentiment,
    RuleBasedSentimentAnalyzer, Sentiment, SentimentLexicon, SentimentResult, SentimentRules,
    SentimentWordCounts, VaderResult, VaderSentimentAnalyzer,
};
pub use simd_ops::{
    AdvancedSIMDTextProcessor, SimdEditDistance, SimdStringOps, SimdTextAnalyzer,
    TextProcessingResult,
};
pub use sparse::{CsrMatrix, DokMatrix, SparseMatrixBuilder, SparseVector};
pub use sparse_vectorize::{
    sparse_cosine_similarity, MemoryStats, SparseCountVectorizer, SparseTfidfVectorizer,
};
pub use spelling::{
    DictionaryCorrector, DictionaryCorrectorConfig, EditOp, ErrorModel, NGramModel,
    SpellingCorrector, StatisticalCorrector, StatisticalCorrectorConfig,
};
pub use stemming::{
    LancasterStemmer, LemmatizerConfig, PorterStemmer, PosTag, RuleLemmatizer,
    RuleLemmatizerBuilder, SimpleLemmatizer, SnowballStemmer, Stemmer,
};
pub use streaming::{
    AdvancedStreamingMetrics, AdvancedStreamingProcessor, ChunkedCorpusReader, MemoryMappedCorpus,
    ProgressTracker, StreamingTextProcessor, StreamingVectorizer,
};
pub use string_metrics::{
    AlignmentResult, DamerauLevenshteinMetric, Metaphone, NeedlemanWunsch, Nysiis,
    PhoneticAlgorithm, SmithWaterman, Soundex, StringMetric,
};
pub use summarization::{CentroidSummarizer, KeywordExtractor, TextRank};
pub use text_coordinator::{
    AdvancedBatchClassificationResult, AdvancedSemanticSimilarityResult, AdvancedTextConfig,
    AdvancedTextCoordinator, AdvancedTextResult, AdvancedTopicModelingResult,
};
pub use text_statistics::{ReadabilityMetrics, TextMetrics, TextStatistics};
pub use token_filter::{
    CompositeFilter, CustomFilter, FrequencyFilter, LengthFilter, RegexFilter, StopwordsFilter,
    TokenFilter,
};
pub use tokenize::{
    bpe::{BpeConfig, BpeTokenizer, BpeVocabulary},
    CharacterTokenizer, NgramTokenizer, RegexTokenizer, SentenceTokenizer, Tokenizer,
    WhitespaceTokenizer, WordTokenizer,
};
pub use tokenizer::{
    BPETokenizer, SimpleCharTokenizer, SimpleWhitespaceTokenizer, TransformerTokenizer,
    WordPieceTokenizer,
};
pub use topic_coherence::{TopicCoherence, TopicDiversity};
pub use topic_modeling::{
    LatentDirichletAllocation, LdaBuilder, LdaConfig, LdaLearningMethod, Topic as LdaTopic,
};
pub use transformer::{
    FeedForward, LayerNorm, MultiHeadAttention, PositionalEncoding, TokenEmbedding,
    TransformerConfig, TransformerDecoder, TransformerDecoderLayer, TransformerEncoder,
    TransformerEncoderLayer, TransformerModel,
};
pub use vectorize::{CountVectorizer, TfidfVectorizer, Vectorizer};
pub use visualization::{
    AttentionVisualizer, Color, ColorScheme, EmbeddingVisualizer, SentimentVisualizer,
    TextAnalyticsDashboard, TopicVisualizer, VisualizationConfig, WordCloud,
};
pub use vocabulary::Vocabulary;
pub use weighted_distance::{
    DamerauLevenshteinWeights, LevenshteinWeights, WeightedDamerauLevenshtein, WeightedLevenshtein,
    WeightedStringMetric,
};

// Re-exports for new modules
pub use keyword_extraction::{
    extract_keywords, Keyword, KeywordMethod, RakeKeywordExtractor, TextRankKeywordExtractor,
    TfIdfKeywordExtractor,
};
pub use language_detection::{
    detect_language, detect_language_with_strategy, DetectedLanguage, DetectionStrategy,
    LanguageDetectionOutput,
};
pub use named_entity_recognition::{extract_entities, NerEntity, NerEntityType, NerPatternConfig};
pub use text_similarity::{
    bm25_score, char_ngram_jaccard_similarity, edit_distance_similarity, jaccard_token_similarity,
    text_similarity, tfidf_cosine_similarity, Bm25Config, Bm25Scorer, SimilarityMethod,
    SimilarityResult, TfIdfCosineSimilarity,
};
pub use text_summarization::{
    score_position, score_textrank, score_tfidf, summarize, ScoredSentence, SummarizationMethod,
};

// HDP topic model
pub use hdp::{HdpConfig, HdpModel, HdpResult};

// New sentence encoder (USE-style, word-level)
pub use sentence_embeddings::{
    SentenceEncoder, SentenceEncoderConfig, SentenceEncoderPooling, SimCSELoss, SimCSETrainer,
};

// topic module re-exports (HDP with automatic topic selection)
pub use topic::hdp::{
    Hdp, HdpConfig as HdpAutoConfig, HdpState, HdpTopicConfig, HdpTopicModel, TopicError,
};

// tokenizers module (Unicode tokenizer from tokenizers crate kept accessible via path)
pub use embeddings::sentence_encoder::{
    PoolingStrategy as SentencePoolingStrategy, SemanticSimilarity as EmbeddingSearch,
    SentenceEncoder as ProjSentenceEncoder, SimCseConfig, SimCseTrainer as ProjSimCseTrainer,
};

// Transliteration: trait + concrete transliterators
pub use transliteration::{
    CyrillicScheme, CyrillicTransliterator, HepburnTransliterator, PinyinStyle,
    PinyinTransliterator, Transliterator,
};

// Neural NLP: transformer encoder, attention viz, BERT classifier, neural NER
pub mod neural_nlp;
pub use neural_nlp::{
    AttentionHeatmap, AttentionVisualization, BertClassifier, BertClassifierConfig, NerTag,
    NeuralNer, NeuralNerConfig, TransformerEncoderConfig as NeuralEncoderConfig,
    TransformerTextEncoder,
};