scirs2-text 0.3.0

Text processing module for SciRS2 (scirs2-text)
Documentation

scirs2-text

crates.io License Documentation Version

Comprehensive natural language processing and text processing library for the SciRS2 scientific computing ecosystem. Provides tokenization, vectorization, word embeddings, NER, CRF sequence labelling, BPE, dependency parsing, topic modeling, summarization, question answering, knowledge graph extraction, and much more — all in pure, safe Rust with parallel processing.

Features

Tokenization

  • Word tokenizer: Unicode-aware, configurable punctuation handling
  • Sentence tokenizer: Rule-based sentence boundary detection
  • Character/grapheme tokenizer: Unicode grapheme cluster segmentation
  • N-gram tokenizer: Unigrams through arbitrary n-grams, range support
  • Regex tokenizer: Pattern-based tokenization and gap tokenization
  • BPE tokenizer (Byte Pair Encoding): Subword tokenization with vocabulary learning
  • WordPiece tokenizer: BERT-style subword tokenization

Text Preprocessing

  • Unicode normalization (NFC, NFD, NFKC, NFKD)
  • Case folding, accent removal
  • HTML/XML stripping, URL and email normalization
  • Contraction expansion
  • Number normalization (dates, currencies, percentages)
  • Stopword removal, whitespace normalization
  • TextPreprocessor pipeline: chain normalizers and cleaners

Stemming and Lemmatization

  • Porter Stemmer (English)
  • Snowball Stemmer (English, extensible)
  • Lancaster Stemmer
  • Rule-based lemmatizer with morphological analysis

Text Vectorization

  • Count Vectorizer: Bag-of-words, N-gram support, min/max document frequency filtering
  • TF-IDF Vectorizer: IDF smoothing, sublinear TF scaling, L1/L2 normalization
  • Binary Vectorizer: Occurrence-only representation
  • Enhanced vectorizers: EnhancedCountVectorizer, EnhancedTfidfVectorizer with max_features, vocabulary pruning
  • Sparse matrix representation for memory efficiency

Word Embeddings

  • Word2Vec (Skip-gram and CBOW): negative sampling, configurable window size, hierarchical softmax
  • GloVe loading: Load pre-trained GloVe vectors
  • FastText (pure Rust): Subword embeddings with character n-gram support
  • Embedding similarity: cosine similarity, most-similar-words
  • Save/load in binary and text formats

Sequence Labelling

  • CRF (Conditional Random Fields): Viterbi decoding, feature engineering, sequence-to-label
  • HMM (Hidden Markov Model): Forward-backward, Viterbi, for POS tagging
  • Custom feature extractors for NER, POS, chunking tasks

Named Entity Recognition (NER)

  • Rule-based NER with regex patterns
  • Dictionary-based NER with gazetteer support
  • CRF-based NER with feature engineering
  • Standard entity types: PER, ORG, LOC, DATE, TIME, MONEY, PERCENT

Advanced NLP (v0.3.0+)

  • Coreference resolution: mention detection and clustering
  • Dependency parsing: arc-factored dependency graph construction
  • Discourse analysis: rhetorical structure theory primitives
  • Event extraction: event trigger and argument extraction
  • Question answering: extractive span detection
  • Knowledge graph extraction: entity-relation-entity triples from text
  • Semantic parsing: logical form generation from natural language
  • Temporal extraction: date and time expression normalization
  • Grammar checking: rule-based error detection

Topic Modeling

  • LDA (Latent Dirichlet Allocation): variational inference, coherence metrics (CV, UMass, UCI)
  • NMF-based topic modeling: non-negative matrix factorization topics
  • TopicModel trait for interchangeable backends

Text Summarization

  • Extractive: TextRank, centroid-based, keyword-based extraction
  • Abstractive: sequence-to-sequence summarization primitives (abstractive_summary.rs)

Sentiment Analysis

  • Lexicon-based: VADER-style sentiment scoring with basic lexicon
  • Rule-based: negation handling, intensifiers, modifiers
  • ML-based adapter: integrate trained classifiers

Text Classification

  • Feature extraction pipeline for classification
  • Multinomial Naive Bayes (text-optimized)
  • Logistic regression adapter
  • Dataset handling utilities

String Metrics and Phonetics

  • Levenshtein distance (basic and Damerau-Levenshtein)
  • Optimal String Alignment (restricted Damerau-Levenshtein)
  • Weighted Levenshtein / Weighted Damerau-Levenshtein (custom operation costs)
  • Jaro-Winkler similarity
  • Cosine similarity, Jaccard similarity
  • Soundex phonetic encoding
  • Metaphone phonetic algorithm
  • NYSIIS phonetic algorithm

Language Model Primitives

  • N-gram language model with Kneser-Ney smoothing
  • Character-level language model
  • Perplexity computation

Multilingual Support

  • Unicode-first throughout
  • Language detection via N-gram character models
  • Multilingual utilities (multilingual_ext.rs)

Performance

  • Parallel processing: Rayon-based multi-threaded tokenization and corpus vectorization
  • Batch processing: Efficient large document collection handling
  • Sparse matrices: Memory-efficient vectorizer outputs
  • SIMD operations: simd_ops.rs for accelerated string comparisons and distance computation
  • Memory-mapped corpus: Streaming large corpus without full RAM loading

Installation

[dependencies]
scirs2-text = "0.3.0"

Quick Start

Tokenization and Vectorization

use scirs2_text::{
    tokenize::{WordTokenizer, NgramTokenizer, Tokenizer},
    vectorize::{TfidfVectorizer, CountVectorizer, Vectorizer},
    stemming::{PorterStemmer, Stemmer},
    preprocess::{BasicNormalizer, BasicTextCleaner, TextNormalizer, TextCleaner},
};

// Normalization
let normalizer = BasicNormalizer::default();
let normalized = normalizer.normalize("Hello, World! THIS is a TEST.")?;

// Word tokenization
let tokenizer = WordTokenizer::new(true);  // lowercase=true
let tokens = tokenizer.tokenize("The quick brown fox jumps")?;

// N-gram tokenization
let bigrams = NgramTokenizer::new(2)?.tokenize("hello world test")?;
let range   = NgramTokenizer::with_range(1, 3)?.tokenize("hello world test")?;

// Stemming
let stemmer = PorterStemmer::new();
let stem    = stemmer.stem("running")?;    // "run"

// TF-IDF vectorization
let documents = vec![
    "The quick brown fox jumps over the lazy dog",
    "A quick brown dog outpaces a quick fox",
    "The lazy dog sleeps all day",
];
let mut tfidf = TfidfVectorizer::new(false, true, Some("l2".to_string()));
let matrix    = tfidf.fit_transform(&documents)?;

BPE Tokenizer

use scirs2_text::bpe_tokenizer::BpeTokenizer;

let corpus = vec![
    "the quick brown fox",
    "the lazy dog",
    "quick brown dog",
];

// Learn BPE vocabulary from corpus
let mut bpe = BpeTokenizer::new(1000);  // vocab_size=1000
bpe.fit(&corpus)?;

let tokens = bpe.tokenize("the quick fox")?;
println!("{:?}", tokens);

// Save/load vocabulary
bpe.save_vocab("bpe_vocab.json")?;
let loaded = BpeTokenizer::load_vocab("bpe_vocab.json")?;

CRF Sequence Labelling (NER)

use scirs2_text::sequence_labeling::CrfTagger;

// Build CRF tagger for NER
let mut crf = CrfTagger::new();
crf.add_feature_fn(|token, _context| {
    vec![
        format!("word={}", token.to_lowercase()),
        format!("is_upper={}", token.chars().next().map_or(false, |c| c.is_uppercase())),
    ]
});

// Train on labelled data
crf.fit(&train_sequences, &train_labels, 100, 0.01)?;

// Predict labels for new sequence
let tokens = vec!["John", "Smith", "visited", "London"];
let labels = crf.predict(&tokens)?;
// e.g. ["B-PER", "I-PER", "O", "B-LOC"]

Named Entity Recognition

use scirs2_text::ner::NerTagger;

let mut ner = NerTagger::new();
ner.add_gazetteer("PER", &["John Smith", "Jane Doe"])?;
ner.add_pattern("DATE", r"\d{4}-\d{2}-\d{2}")?;

let text     = "John Smith visited London on 2024-01-15.";
let entities = ner.extract(text)?;

for entity in &entities {
    println!("{}: {} ({}..{})", entity.label, entity.text, entity.start, entity.end);
}

Topic Modeling (LDA)

use scirs2_text::topic_model::LatentDirichletAllocation;

let documents = load_corpus()?;
let mut lda = LatentDirichletAllocation::new(10, 0.1, 0.01, 1000, Some(42));
lda.fit(&documents)?;

// Get top words per topic
for (topic_id, words) in lda.top_words(10).iter().enumerate() {
    println!("Topic {}: {}", topic_id, words.join(", "));
}

// Document-topic distributions
let dist = lda.transform(&["new document text"])?;

// Coherence score
let coherence = lda.coherence_cv(&documents, 10)?;
println!("CV coherence: {:.4}", coherence);

Sentiment Analysis

use scirs2_text::sentiment::LexiconSentimentAnalyzer;

let analyzer  = LexiconSentimentAnalyzer::with_basiclexicon();
let result    = analyzer.analyze("I love this library! It's fantastic.")?;

println!("Sentiment: {:?}", result.sentiment);   // Positive
println!("Score: {:.4}", result.compound_score);

Knowledge Graph Extraction

use scirs2_text::knowledge_graph::KnowledgeGraphExtractor;

let extractor = KnowledgeGraphExtractor::new();
let text      = "Albert Einstein developed the theory of relativity.";
let triples   = extractor.extract(text)?;

for triple in &triples {
    println!("({}, {}, {})", triple.subject, triple.relation, triple.object);
}
// -> ("Albert Einstein", "developed", "theory of relativity")

Word Embeddings (Word2Vec)

use scirs2_text::embeddings::{Word2Vec, Word2VecConfig, Word2VecAlgorithm};

let config = Word2VecConfig {
    vector_size: 100,
    window: 5,
    min_count: 2,
    algorithm: Word2VecAlgorithm::SkipGram,
    iterations: 15,
    negative_samples: 5,
    ..Default::default()
};

let mut model = Word2Vec::builder().config(config).build()?;
model.train(&documents)?;

let similar = model.most_similar("computer", 5)?;
for (word, similarity) in &similar {
    println!("{}: {:.4}", word, similarity);
}

Advanced String Metrics

use scirs2_text::string_metrics::{DamerauLevenshteinMetric, StringMetric, Soundex};
use scirs2_text::weighted_distance::{WeightedLevenshtein, LevenshteinWeights};
use std::collections::HashMap;

// Damerau-Levenshtein with transpositions
let dl = DamerauLevenshteinMetric::new();
let d  = dl.distance("kitten", "sitting")?;

// Weighted Levenshtein with custom operation costs
let weights = LevenshteinWeights::new(2.0, 1.0, 0.5);  // ins=2, del=1, sub=0.5
let wl      = WeightedLevenshtein::with_weights(weights);
let wd      = wl.distance("kitten", "sitting")?;

// Character-pair specific substitution costs
let mut costs = HashMap::new();
costs.insert(('k', 's'), 0.1);   // Make k→s substitution cheap
let char_weights = LevenshteinWeights::default().with_substitution_costs(costs);
let cwl = WeightedLevenshtein::with_weights(char_weights);

// Phonetic encoding
let soundex = Soundex::new();
println!("{}", soundex.encode("Robert")?);      // R163
println!("{}", soundex.sounds_like("Smith", "Smythe")?);  // true

Text Statistics and Readability

use scirs2_text::text_statistics::{TextStatistics, ReadabilityMetrics};

let stats   = TextStatistics::new();
let text    = "The quick brown fox jumps over the lazy dog. A simple sentence.";
let metrics = stats.get_all_metrics(text)?;

println!("Flesch Reading Ease: {:.2}",          metrics.flesch_reading_ease);
println!("Flesch-Kincaid Grade: {:.2}",         metrics.flesch_kincaid_grade_level);
println!("Gunning Fog Index: {:.2}",            metrics.gunning_fog);
println!("Lexical Diversity: {:.4}",            metrics.lexical_diversity);
println!("Word count: {}",                      metrics.text_statistics.word_count);
println!("Avg sentence length: {:.2}",          metrics.text_statistics.avg_sentence_length);

Module Map

Module Contents
tokenize Word, sentence, char, n-gram, regex tokenizers
bpe_tokenizer Byte Pair Encoding tokenizer with vocabulary learning
preprocess Normalizers, cleaners, preprocessing pipeline
stemming Porter, Snowball, Lancaster stemmers; lemmatizer
vectorize Count, TF-IDF, binary vectorizers
enhanced_vectorize Enhanced vectorizers with n-gram and filtering
embeddings Word2Vec (Skip-gram/CBOW), GloVe loader, FastText
ner Named entity recognition (rule, dictionary, CRF-based)
sequence_labeling CRF and HMM for sequence labelling (POS, NER, chunking)
sentiment Lexicon-based and rule-based sentiment analysis
topic_model LDA, NMF-based topic modeling
text_classification Feature extraction, Naive Bayes, evaluation
string_metrics Levenshtein, Damerau-Levenshtein, Jaro-Winkler, phonetics
weighted_distance Weighted edit distances with custom operation costs
text_statistics Readability metrics (Flesch, Gunning Fog, SMOG, etc.)
knowledge_graph Entity-relation triple extraction
coreference Mention detection and coreference clustering
dependency Dependency parsing
discourse Discourse analysis and RST primitives
event_extraction Event trigger and argument extraction
question_answering Extractive QA
abstractive_summary Abstractive summarization primitives
temporal Date/time expression extraction and normalization
grammar Rule-based grammar checking
semantic_parsing Logical form generation
multilingual_ext Multilingual utilities and language detection
language_models N-gram language model, character LM, perplexity
information_theory Entropy, mutual information, KL divergence for text
simd_ops SIMD-accelerated string operations
parallel Parallel corpus processing utilities

Performance

  • Tokenization: ~1M tokens/second (parallel mode)
  • TF-IDF vectorization: ~10K documents/second
  • String similarity: ~100K comparisons/second
  • Topic modeling: scales to 100K+ documents
  • Zero-copy sparse matrix output from vectorizers
  • Memory-mapped corpus support for corpora larger than RAM

Dependencies

  • scirs2-core — RNG, parallel utilities, SIMD primitives
  • regex — Regular expression matching
  • unicode-segmentation — Unicode grapheme cluster segmentation
  • unicode-normalization — Unicode normalization forms (NFC/NFD/NFKC/NFKD)

License

Licensed under the Apache License 2.0. See LICENSE for details.

Authors

COOLJAPAN OU (Team KitaSan)