mecrab-word2vec
High-performance Word2Vec training library for Japanese text, optimized for multi-core CPUs.
Features
- ✅ Skip-gram with Negative Sampling - Industry-standard algorithm
- ✅ Hogwild! Parallelization - Lock-free multi-threading (83% efficiency on 6 cores)
- ✅ MCV1 Binary Format - Memory-mapped vector storage for instant loading
- ✅ Pure Rust - Memory-safe, no external C dependencies
- ✅ Zero-copy - Direct pointer arithmetic, minimal allocations
Quick Start
Training Vectors
use ;
let model = new
.vector_size
.window_size
.negative_samples
.min_count
.epochs
.threads
.build_from_corpus?;
model.save_mcv1?;
Using Trained Vectors
use VectorStorage;
let vectors = load_mcv1?;
let vec1 = vectors.get?; // Get vector for word_id 42
let vec2 = vectors.get?;
let similarity = vectors.cosine_similarity?;
Performance
Tested on Japanese Wikipedia corpus (1B+ words):
| Metric | Value |
|---|---|
| Training Speed | ~500K words/sec/core |
| CPU Efficiency | 83% (6 cores) |
| Memory Usage | ~2GB for 160K vocab |
| Vector Lookup | O(1) with memory-mapping |
Input Format
Corpus File (Word IDs)
42 123 456 789
111 222 333
...
Each line is a sentence. Each number is a word_id (from MeCrab's vocabulary).
Output Formats
MCV1 Binary (Recommended)
model.save_mcv1?;
Benefits:
- Memory-mapped (instant loading, no RAM copy)
- Direct indexing by word_id
- Compatible with MeCrab's semantic features
Word2Vec Text Format
model.save_text?;
Format:
163922 100
word1 0.123 -0.456 0.789 ...
word2 -0.234 0.567 -0.890 ...
Architecture
See IMPLEMENTATION.md for technical details on:
- Hogwild! lock-free parallelization
- Direct pointer arithmetic optimization
- Safety guarantees and race condition analysis
Configuration
TrainingConfig
License
See project LICENSE file.
References
- Mikolov et al. (2013) - "Distributed Representations of Words and Phrases and their Compositionality"
- Recht et al. (2011) - "Hogwild!: A Lock-Free Approach to Parallelizing SGD"