SciRS2 Text
[
]

Text analysis and natural language processing module for the SciRS2 scientific computing library. This module provides tools for text processing, vectorization, and comparison.
Features
- Text Preprocessing: Tokenization, normalization, and cleaning utilities
- Text Vectorization: Methods for converting text to numerical representations
- Text Distance Metrics: Various string and text distance measures
- Vocabulary Management: Tools for building and managing vocabularies
- Utility Functions: Helper functions for text manipulation
Installation
Add the following to your Cargo.toml:
[dependencies]
scirs2-text = "0.1.0-alpha.2"
To enable optional features for integration with popular NLP libraries:
[dependencies]
scirs2-text = { version = "0.1.0-alpha.2", features = ["tokenizers", "wordpiece"] }
Usage
Basic usage examples:
use scirs2_text::{tokenize, preprocess, vectorize, distance, vocabulary};
use scirs2_core::error::CoreResult;
fn preprocessing_example() -> CoreResult<()> {
let text = "Hello world! This is an example text for SciRS2 NLP module.";
let clean_text = preprocess::clean_text(text, true, true, true)?;
println!("Cleaned text: '{}'", clean_text);
let tokens = tokenize::word_tokenize(&clean_text)?;
println!("Tokens: {:?}", tokens);
let stemmed = tokenize::stem_tokens(&tokens, "porter")?;
println!("Stemmed tokens: {:?}", stemmed);
let bigrams = tokenize::ngrams(&tokens, 2)?;
println!("Bigrams: {:?}", bigrams);
Ok(())
}
fn vectorization_example() -> CoreResult<()> {
let documents = vec![
"This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?",
];
let (vocab, word_counts) = vocabulary::build_vocabulary(
&documents, 1, None, None, false)?;
println!("Vocabulary: {:?}", vocab);
println!("Word counts: {:?}", word_counts);
let count_vectors = vectorize::count_vectorize(&documents, &vocab)?;
println!("Count vectors:");
for (i, vec) in count_vectors.iter().enumerate() {
println!(" Document {}: {:?}", i, vec);
}
let tfidf_vectors = vectorize::tfidf_vectorize(&documents, &vocab, None, None)?;
println!("TF-IDF vectors:");
for (i, vec) in tfidf_vectors.iter().enumerate() {
println!(" Document {}: {:?}", i, vec);
}
Ok(())
}
fn distance_example() -> CoreResult<()> {
let s1 = "kitten";
let s2 = "sitting";
let lev_dist = distance::levenshtein(s1, s2)?;
println!("Levenshtein distance between '{}' and '{}': {}", s1, s2, lev_dist);
let jaro_sim = distance::jaro_winkler(s1, s2)?;
println!("Jaro-Winkler similarity between '{}' and '{}': {}", s1, s2, jaro_sim);
let doc1 = "This is a test document about NLP";
let doc2 = "This document is about natural language processing";
let cos_sim = distance::cosine_similarity(doc1, doc2, None)?;
println!("Cosine similarity between documents: {}", cos_sim);
Ok(())
}
Components
Tokenization
Functions for text tokenization:
use scirs2_text::tokenize::{
word_tokenize, sent_tokenize, regex_tokenize, stem_tokens, lemmatize_tokens, ngrams, stopwords, remove_stopwords, };
Preprocessing
Text preprocessing utilities:
use scirs2_text::preprocess::{
clean_text, normalize_text, expand_contractions, remove_accents, remove_html_tags, remove_special_chars, remove_numbers, remove_whitespace, replace_urls, replace_emails, };
Text Vectorization
Methods for text vectorization:
use scirs2_text::vectorize::{
count_vectorize, tfidf_vectorize, hashing_vectorize, binary_vectorize, bm25_vectorize, cooccurrence_matrix, };
Distance Metrics
Text distance and similarity measures:
use scirs2_text::distance::{
levenshtein, hamming, damerau_levenshtein,
jaro_winkler, jaccard, sorensen_dice,
cosine_similarity, euclidean_distance, manhattan_distance, };
Vocabulary Management
Tools for building and managing vocabularies:
use scirs2_text::vocabulary::{
build_vocabulary, filter_vocabulary, save_vocabulary, load_vocabulary, map_tokens_to_ids, map_ids_to_tokens, Vocabulary, };
Utilities
Helper functions for text processing:
use scirs2_text::utils::{
split_text, join_tokens, is_digit, is_punctuation, is_stopword, detect_language, count_words, count_sentences, };
Integration with Other Libraries
This module provides easy integration with popular NLP libraries through optional features:
tokenizers: Integration with HuggingFace tokenizers
wordpiece: WordPiece tokenization for transformer models
sentencepiece: SentencePiece tokenization
Example using feature-gated functionality:
use scirs2_text::tokenize::wordpiece_tokenize;
let text = "Hello world, this is WordPiece tokenization.";
let vocab_file = "path/to/wordpiece/vocab.txt";
let tokens = wordpiece_tokenize(text, vocab_file, true, 100).unwrap();
println!("WordPiece tokens: {:?}", tokens);
Contributing
See the CONTRIBUTING.md file for contribution guidelines.
License
This project is dual-licensed under:
You can choose to use either license. See the LICENSE file for details.