Expand description
Word embedding training library.
This crate provides tools for training word embeddings from scratch using SkipGram, CBOW, and other models. It supports:
- Mini-batch training with gradient clipping and L2 regularization
- Learning rate scheduling (constant, exponential, step, cosine)
- Early stopping and evaluation metrics
- Text preprocessing (HTML stripping, URL removal, contraction expansion)
- Source code preprocessing (comment stripping, camelCase splitting)
- BPE subword tokenization
- Export to Word2Vec, NumPy, ONNX, and binary formats
- Semantic search, analogy solving, and embedding arithmetic
- Incremental vocabulary updates and LSH-based approximate nearest neighbors
§Example
use embedding::*;
let data = TrainingData::from_text("the cat sat on the mat");
let config = TrainingConfig::new(ModelType::SkipGram)
.with_dim(8)
.with_epochs(2);
let mut model = EmbeddingModel::new(config, data.vocab.len());
// model.train(&data).unwrap();Re-exports§
pub use mmap::MmapEmbeddings;pub use pretrained::PretrainedEmbeddings;pub use pretrained::PretrainedLoader;pub use config::*;pub use evaluation::*;pub use search::*;pub use code::*;pub use text::*;pub use tokenizer::*;pub use transfer::*;pub use model::*;pub use backend::*;pub use benchmark::*;pub use transformer::*;
Modules§
- backend
- benchmark
- cli
- code
- config
- evaluation
- mmap
- model
- onnx
- Low-level ONNX protobuf definitions generated by prost.
- pretrained
- search
- text
- tokenizer
- transfer
- transformer
Structs§
- Incremental
Trainer - Supports real-time incremental training by updating a model with new sentences as they arrive, without requiring a full retrain.