torsh-text
Text processing and NLP utilities for ToRSh, leveraging scirs2-text for efficient text operations.
Overview
This crate provides comprehensive text processing capabilities:
- Tokenization: Various tokenization strategies (word, subword, character)
- Vocabulary Management: Efficient vocabulary building and management
- Text Datasets: Common NLP datasets and data loaders
- Preprocessing: Text normalization, cleaning, and augmentation
- Embeddings: Word embeddings and embedding layers
- Utilities: Text generation, metrics, and analysis tools
Note: This crate integrates with scirs2-text for optimized text processing operations.
Usage
Tokenization
use *;
// Basic tokenizer
let tokenizer = new
.do_lower_case
.strip_accents;
let text = "Hello, World! This is a test.";
let tokens = tokenizer.tokenize?;
// WordPiece tokenizer
let vocab = load_vocab?;
let wordpiece = new
.unk_token
.max_input_chars_per_word;
let tokens = wordpiece.tokenize?;
// ["un", "##aff", "##able"]
// Sentence Piece tokenizer
let sp_model = from_file?;
let tokens = sp_model.encode?;
let decoded = sp_model.decode?;
// BPE tokenizer
let bpe = from_file?;
let encoding = bpe.encode?;
Vocabulary Management
use *;
// Build vocabulary from corpus
let corpus = vec!;
let vocab = build_from_iterator?;
// Convert between tokens and indices
let indices = vocab.tokens_to_indices?;
let tokens = vocab.indices_to_tokens?;
// Save and load vocabulary
vocab.save?;
let loaded = from_file?;
Text Datasets
use *;
// IMDB sentiment dataset
let imdb = IMDBnew?;
for in imdb.iter
// Custom text classification dataset
let dataset = from_csv?;
// Language modeling dataset
let lm_dataset = from_file?;
// Translation dataset
let translation = new?;
Text Preprocessing
use *;
// Text normalization
let normalizer = new
.lowercase
.remove_punctuation
.normalize_unicode
.expand_contractions;
let normalized = normalizer.normalize?;
// "do not forget the cafe"
// Text augmentation
let augmenter = new
.add_synonym_replacement
.add_random_insertion
.add_random_swap
.add_random_deletion;
let augmented = augmenter.augment?;
// Padding and truncation
let padded = pad_sequence?;
Embeddings
use *;
// Pre-trained embeddings
let glove = from_file?;
let word_vector = glove.get_vector?;
// Embedding layer
let embedding = new?;
// Initialize with pre-trained
embedding.from_pretrained?;
// Contextual embeddings
let bert_embeddings = new?;
Text Generation
use *;
// Text generation utilities
let generator = new
.temperature
.top_k
.top_p
.repetition_penalty;
let generated = generator.generate?;
// Beam search
let beam_output = generator.beam_search?;
// Sampling strategies
let sampled = generator.sample?;
Text Analysis
use *;
// BLEU score
let bleu = calculate_bleu?;
// ROUGE scores
let rouge = calculate_rouge?;
// Perplexity
let perplexity = calculate_perplexity?;
// Text statistics
let stats = from_corpus?;
println!;
println!;
Integration with Models
use *;
use *;
// Text classification model
// Sequence-to-sequence model
// Transformer model
let transformer = new?;
Utilities
// N-gram extraction
let ngrams = extract_ngrams?;
// TF-IDF
let tfidf = fit?;
let tfidf_matrix = tfidf.transform?;
// Text similarity
let sim = cosine_similarity?;
// Sentence splitting
let sentences = split_sentences?;
// Language detection
let language = detect_language?;
Integration with SciRS2
This crate leverages scirs2-text for:
- Efficient string operations
- Optimized tokenization algorithms
- Fast vocabulary lookups
- Vectorized text processing
License
Licensed under the Apache License, Version 2.0. See LICENSE for details.