scirs2-text
Comprehensive natural language processing and text processing library for the SciRS2 scientific computing ecosystem. Provides tokenization, vectorization, word embeddings, NER, CRF sequence labelling, BPE, dependency parsing, topic modeling, summarization, question answering, knowledge graph extraction, and much more — all in pure, safe Rust with parallel processing.
Features
Tokenization
- Word tokenizer: Unicode-aware, configurable punctuation handling
- Sentence tokenizer: Rule-based sentence boundary detection
- Character/grapheme tokenizer: Unicode grapheme cluster segmentation
- N-gram tokenizer: Unigrams through arbitrary n-grams, range support
- Regex tokenizer: Pattern-based tokenization and gap tokenization
- BPE tokenizer (Byte Pair Encoding): Subword tokenization with vocabulary learning
- WordPiece tokenizer: BERT-style subword tokenization
Text Preprocessing
- Unicode normalization (NFC, NFD, NFKC, NFKD)
- Case folding, accent removal
- HTML/XML stripping, URL and email normalization
- Contraction expansion
- Number normalization (dates, currencies, percentages)
- Stopword removal, whitespace normalization
TextPreprocessorpipeline: chain normalizers and cleaners
Stemming and Lemmatization
- Porter Stemmer (English)
- Snowball Stemmer (English, extensible)
- Lancaster Stemmer
- Rule-based lemmatizer with morphological analysis
Text Vectorization
- Count Vectorizer: Bag-of-words, N-gram support, min/max document frequency filtering
- TF-IDF Vectorizer: IDF smoothing, sublinear TF scaling, L1/L2 normalization
- Binary Vectorizer: Occurrence-only representation
- Enhanced vectorizers:
EnhancedCountVectorizer,EnhancedTfidfVectorizerwith max_features, vocabulary pruning - Sparse matrix representation for memory efficiency
Word Embeddings
- Word2Vec (Skip-gram and CBOW): negative sampling, configurable window size, hierarchical softmax
- GloVe loading: Load pre-trained GloVe vectors
- FastText (pure Rust): Subword embeddings with character n-gram support
- Embedding similarity: cosine similarity, most-similar-words
- Save/load in binary and text formats
Sequence Labelling
- CRF (Conditional Random Fields): Viterbi decoding, feature engineering, sequence-to-label
- HMM (Hidden Markov Model): Forward-backward, Viterbi, for POS tagging
- Custom feature extractors for NER, POS, chunking tasks
Named Entity Recognition (NER)
- Rule-based NER with regex patterns
- Dictionary-based NER with gazetteer support
- CRF-based NER with feature engineering
- Standard entity types: PER, ORG, LOC, DATE, TIME, MONEY, PERCENT
Advanced NLP (v0.3.0+)
- Coreference resolution: mention detection and clustering
- Dependency parsing: arc-factored dependency graph construction
- Discourse analysis: rhetorical structure theory primitives
- Event extraction: event trigger and argument extraction
- Question answering: extractive span detection
- Knowledge graph extraction: entity-relation-entity triples from text
- Semantic parsing: logical form generation from natural language
- Temporal extraction: date and time expression normalization
- Grammar checking: rule-based error detection
Topic Modeling
- LDA (Latent Dirichlet Allocation): variational inference, coherence metrics (CV, UMass, UCI)
- NMF-based topic modeling: non-negative matrix factorization topics
TopicModeltrait for interchangeable backends
Text Summarization
- Extractive: TextRank, centroid-based, keyword-based extraction
- Abstractive: sequence-to-sequence summarization primitives (
abstractive_summary.rs)
Sentiment Analysis
- Lexicon-based: VADER-style sentiment scoring with basic lexicon
- Rule-based: negation handling, intensifiers, modifiers
- ML-based adapter: integrate trained classifiers
Text Classification
- Feature extraction pipeline for classification
- Multinomial Naive Bayes (text-optimized)
- Logistic regression adapter
- Dataset handling utilities
String Metrics and Phonetics
- Levenshtein distance (basic and Damerau-Levenshtein)
- Optimal String Alignment (restricted Damerau-Levenshtein)
- Weighted Levenshtein / Weighted Damerau-Levenshtein (custom operation costs)
- Jaro-Winkler similarity
- Cosine similarity, Jaccard similarity
- Soundex phonetic encoding
- Metaphone phonetic algorithm
- NYSIIS phonetic algorithm
Language Model Primitives
- N-gram language model with Kneser-Ney smoothing
- Character-level language model
- Perplexity computation
Multilingual Support
- Unicode-first throughout
- Language detection via N-gram character models
- Multilingual utilities (
multilingual_ext.rs)
Performance
- Parallel processing: Rayon-based multi-threaded tokenization and corpus vectorization
- Batch processing: Efficient large document collection handling
- Sparse matrices: Memory-efficient vectorizer outputs
- SIMD operations:
simd_ops.rsfor accelerated string comparisons and distance computation - Memory-mapped corpus: Streaming large corpus without full RAM loading
Installation
[]
= "0.3.0"
Quick Start
Tokenization and Vectorization
use ;
// Normalization
let normalizer = default;
let normalized = normalizer.normalize?;
// Word tokenization
let tokenizer = new; // lowercase=true
let tokens = tokenizer.tokenize?;
// N-gram tokenization
let bigrams = new?.tokenize?;
let range = with_range?.tokenize?;
// Stemming
let stemmer = new;
let stem = stemmer.stem?; // "run"
// TF-IDF vectorization
let documents = vec!;
let mut tfidf = new;
let matrix = tfidf.fit_transform?;
BPE Tokenizer
use BpeTokenizer;
let corpus = vec!;
// Learn BPE vocabulary from corpus
let mut bpe = new; // vocab_size=1000
bpe.fit?;
let tokens = bpe.tokenize?;
println!;
// Save/load vocabulary
bpe.save_vocab?;
let loaded = load_vocab?;
CRF Sequence Labelling (NER)
use CrfTagger;
// Build CRF tagger for NER
let mut crf = new;
crf.add_feature_fn;
// Train on labelled data
crf.fit?;
// Predict labels for new sequence
let tokens = vec!;
let labels = crf.predict?;
// e.g. ["B-PER", "I-PER", "O", "B-LOC"]
Named Entity Recognition
use NerTagger;
let mut ner = new;
ner.add_gazetteer?;
ner.add_pattern?;
let text = "John Smith visited London on 2024-01-15.";
let entities = ner.extract?;
for entity in &entities
Topic Modeling (LDA)
use LatentDirichletAllocation;
let documents = load_corpus?;
let mut lda = new;
lda.fit?;
// Get top words per topic
for in lda.top_words.iter.enumerate
// Document-topic distributions
let dist = lda.transform?;
// Coherence score
let coherence = lda.coherence_cv?;
println!;
Sentiment Analysis
use LexiconSentimentAnalyzer;
let analyzer = with_basiclexicon;
let result = analyzer.analyze?;
println!; // Positive
println!;
Knowledge Graph Extraction
use KnowledgeGraphExtractor;
let extractor = new;
let text = "Albert Einstein developed the theory of relativity.";
let triples = extractor.extract?;
for triple in &triples
// -> ("Albert Einstein", "developed", "theory of relativity")
Word Embeddings (Word2Vec)
use ;
let config = Word2VecConfig ;
let mut model = builder.config.build?;
model.train?;
let similar = model.most_similar?;
for in &similar
Advanced String Metrics
use ;
use ;
use HashMap;
// Damerau-Levenshtein with transpositions
let dl = new;
let d = dl.distance?;
// Weighted Levenshtein with custom operation costs
let weights = new; // ins=2, del=1, sub=0.5
let wl = with_weights;
let wd = wl.distance?;
// Character-pair specific substitution costs
let mut costs = new;
costs.insert; // Make k→s substitution cheap
let char_weights = default.with_substitution_costs;
let cwl = with_weights;
// Phonetic encoding
let soundex = new;
println!; // R163
println!; // true
Text Statistics and Readability
use ;
let stats = new;
let text = "The quick brown fox jumps over the lazy dog. A simple sentence.";
let metrics = stats.get_all_metrics?;
println!;
println!;
println!;
println!;
println!;
println!;
Module Map
| Module | Contents |
|---|---|
tokenize |
Word, sentence, char, n-gram, regex tokenizers |
bpe_tokenizer |
Byte Pair Encoding tokenizer with vocabulary learning |
preprocess |
Normalizers, cleaners, preprocessing pipeline |
stemming |
Porter, Snowball, Lancaster stemmers; lemmatizer |
vectorize |
Count, TF-IDF, binary vectorizers |
enhanced_vectorize |
Enhanced vectorizers with n-gram and filtering |
embeddings |
Word2Vec (Skip-gram/CBOW), GloVe loader, FastText |
ner |
Named entity recognition (rule, dictionary, CRF-based) |
sequence_labeling |
CRF and HMM for sequence labelling (POS, NER, chunking) |
sentiment |
Lexicon-based and rule-based sentiment analysis |
topic_model |
LDA, NMF-based topic modeling |
text_classification |
Feature extraction, Naive Bayes, evaluation |
string_metrics |
Levenshtein, Damerau-Levenshtein, Jaro-Winkler, phonetics |
weighted_distance |
Weighted edit distances with custom operation costs |
text_statistics |
Readability metrics (Flesch, Gunning Fog, SMOG, etc.) |
knowledge_graph |
Entity-relation triple extraction |
coreference |
Mention detection and coreference clustering |
dependency |
Dependency parsing |
discourse |
Discourse analysis and RST primitives |
event_extraction |
Event trigger and argument extraction |
question_answering |
Extractive QA |
abstractive_summary |
Abstractive summarization primitives |
temporal |
Date/time expression extraction and normalization |
grammar |
Rule-based grammar checking |
semantic_parsing |
Logical form generation |
multilingual_ext |
Multilingual utilities and language detection |
language_models |
N-gram language model, character LM, perplexity |
information_theory |
Entropy, mutual information, KL divergence for text |
simd_ops |
SIMD-accelerated string operations |
parallel |
Parallel corpus processing utilities |
Performance
- Tokenization: ~1M tokens/second (parallel mode)
- TF-IDF vectorization: ~10K documents/second
- String similarity: ~100K comparisons/second
- Topic modeling: scales to 100K+ documents
- Zero-copy sparse matrix output from vectorizers
- Memory-mapped corpus support for corpora larger than RAM
Dependencies
scirs2-core— RNG, parallel utilities, SIMD primitivesregex— Regular expression matchingunicode-segmentation— Unicode grapheme cluster segmentationunicode-normalization— Unicode normalization forms (NFC/NFD/NFKC/NFKD)
License
Licensed under the Apache License 2.0. See LICENSE for details.
Authors
COOLJAPAN OU (Team KitaSan)