Skip to main content

Crate embedding

Crate embedding 

Source
Expand description

Word embedding training library.

This crate provides tools for training word embeddings from scratch using SkipGram, CBOW, and other models. It supports:

  • Mini-batch training with gradient clipping and L2 regularization
  • Learning rate scheduling (constant, exponential, step, cosine)
  • Early stopping and evaluation metrics
  • Text preprocessing (HTML stripping, URL removal, contraction expansion)
  • Source code preprocessing (comment stripping, camelCase splitting)
  • BPE subword tokenization
  • Export to Word2Vec, NumPy, ONNX, and binary formats
  • Semantic search, analogy solving, and embedding arithmetic
  • Incremental vocabulary updates and LSH-based approximate nearest neighbors

§Example

use embedding::*;

let data = TrainingData::from_text("the cat sat on the mat");
let config = TrainingConfig::new(ModelType::SkipGram)
    .with_dim(8)
    .with_epochs(2);

let mut model = EmbeddingModel::new(config, data.vocab.len());
// model.train(&data).unwrap();

Re-exports§

pub use mmap::MmapEmbeddings;
pub use pretrained::PretrainedEmbeddings;
pub use pretrained::PretrainedLoader;
pub use config::*;
pub use evaluation::*;
pub use search::*;
pub use code::*;
pub use text::*;
pub use tokenizer::*;
pub use transfer::*;
pub use model::*;
pub use backend::*;
pub use benchmark::*;
pub use transformer::*;

Modules§

backend
benchmark
cli
code
config
evaluation
mmap
model
onnx
Low-level ONNX protobuf definitions generated by prost.
pretrained
search
text
tokenizer
transfer
transformer

Structs§

IncrementalTrainer
Supports real-time incremental training by updating a model with new sentences as they arrive, without requiring a full retrain.