Embellama
High-performance Rust library for generating text embeddings using llama-cpp.
Features
- High Performance: Optimized for speed with parallel pre/post-processing
- Thread Safety: Compile-time guarantees for safe concurrent usage
- Multiple Models: Support for managing multiple embedding models
- Batch Processing: Efficient batch embedding generation
- Flexible Configuration: Extensive configuration options for model tuning
- Multiple Pooling Strategies: Mean, CLS, Max, and MeanSqrt pooling
- Hardware Acceleration: Support for Metal (macOS), CUDA (NVIDIA), Vulkan, and optimized CPU backends
Quick Start
use ;
// Create configuration
let config = builder
.with_model_path
.with_model_name
.with_normalize_embeddings
.build?;
// Create engine (uses singleton pattern internally)
let engine = new?;
// Generate single embedding
let text = "Hello, world!";
let embedding = engine.embed?;
// Generate batch embeddings
let texts = vec!;
let embeddings = engine.embed_batch?;
Singleton Pattern (Advanced)
The engine can optionally use a singleton pattern for shared access across your application. The singleton methods return Arc<Mutex<EmbeddingEngine>> for thread-safe access:
// Get or initialize singleton instance (returns Arc<Mutex<EmbeddingEngine>>)
let engine = get_or_init?;
// Access the singleton from anywhere in your application
let engine_clone = instance
.expect;
// Use the engine (requires locking the mutex)
let embedding = ;
Tested Models
The library has been tested with the following GGUF models:
- MiniLM-L6-v2 (Q4_K_M): ~15MB, 384-dimensional embeddings - used for integration tests
- Jina Embeddings v2 Base Code (Q4_K_M): ~110MB, 768-dimensional embeddings - used for benchmarks
Both BERT-style and LLaMA-style embedding models are supported.
Installation
Add this to your Cargo.toml:
[]
= "0.4.0"
Backend Features
The library supports multiple backends for hardware acceleration. By default, it uses OpenMP for CPU parallelization. You can enable specific backends based on your hardware:
# Default - OpenMP CPU parallelization
= "0.4.0"
# macOS Metal GPU acceleration
= { = "0.4.0", = ["metal"] }
# NVIDIA CUDA GPU acceleration
= { = "0.4.0", = ["cuda"] }
# Vulkan GPU acceleration (cross-platform)
= { = "0.4.0", = ["vulkan"] }
# Native CPU optimizations
= { = "0.4.0", = ["native"] }
# CPU-optimized build (native + OpenMP)
= { = "0.4.0", = ["cpu-optimized"] }
Note: GPU backends (Metal, CUDA, Vulkan) are mutually exclusive. Use only one at a time for best results.
Configuration
Basic Configuration
let config = builder
.with_model_path
.with_model_name
.build?;
Advanced Configuration
let config = builder
.with_model_path
.with_model_name
.with_context_size // Model context window (usize)
.with_n_threads // CPU threads (usize)
.with_use_gpu // Enable GPU acceleration
.with_n_gpu_layers // Layers to offload to GPU (u32)
.with_batch_size // Batch processing size (usize)
.with_normalize_embeddings // L2 normalize embeddings
.with_pooling_strategy // Pooling method
.with_add_bos_token // Disable BOS for encoder models (Option<bool>)
.build?;
Backend Auto-Detection
The library can automatically detect and use the best available backend:
use ;
// Automatic backend detection
let config = with_backend_detection
.with_model_path
.with_model_name
.build?;
// Check which backend was selected
let backend_info = new;
println!;
println!;
Pooling Strategies
- Mean: Average pooling across all tokens (default)
- CLS: Use the CLS token embedding
- Max: Maximum pooling across dimensions
- MeanSqrt: Mean pooling with square root of sequence length normalization
Model-Specific Configuration
BOS Token Handling
The library automatically detects model types and applies appropriate BOS token handling:
Encoder Models (BERT, E5, BGE, GTE, MiniLM, etc.):
- BOS token is not added (these models use CLS/SEP tokens)
- Auto-detected by model name patterns
Decoder Models (LLaMA, Mistral, Vicuna, etc.):
- BOS token is added (standard for autoregressive models)
- Default behavior for unknown models
Manual Override:
// Force disable BOS for a specific model
let config = builder
.with_model_path
.with_model_name
.with_add_bos_token // Explicitly disable BOS
.build?;
// Force enable BOS
let config = builder
.with_model_path
.with_model_name
.with_add_bos_token // Explicitly enable BOS
.build?;
// Auto-detect (default)
let config = builder
.with_model_path
.with_model_name
.with_add_bos_token // Let the library decide
.build?;
Thread Safety
⚠️ IMPORTANT: The LlamaContext from llama-cpp is !Send and !Sync, which means:
- Models cannot be moved between threads
- Models cannot be shared using
Arcalone - Each thread must own its model instance
- All concurrency must use message passing
The library is designed with these constraints in mind:
- Models are
!Senddue to llama-cpp constraints - Use thread-local storage for model instances
- Batch processing uses parallel pre/post-processing with sequential inference
Example of thread-safe usage with regular (non-singleton) engine:
use thread;
// Each thread needs its own engine instance due to llama-cpp constraints
let handles: =
.map
.collect;
for handle in handles
Or using the singleton pattern for shared access:
use thread;
// Initialize singleton once
let engine = get_or_init?;
let handles: =
.map
.collect;
for handle in handles
API Reference
Model Management
The library provides granular control over model lifecycle:
Registration vs Loading
- Registration: Model configuration stored in registry
- Loading: Model actually loaded in thread-local memory
// Check if model is registered (has configuration)
if engine.is_model_registered
// Check if model is loaded in current thread
if engine.is_model_loaded_in_thread
// Deprecated - use is_model_registered() for clarity
engine.is_model_loaded; // Same as is_model_registered()
Granular Unload Operations
// Remove only from current thread (keeps registration)
engine.drop_model_from_thread?;
// Model can be reloaded on next use
// Remove only from registry (prevents future loads)
engine.unregister_model?;
// Existing thread-local instances continue working
// Full unload - removes from both registry and thread
engine.unload_model?;
// Completely removes the model
Model Loading Behavior
- Initial model (via
EmbeddingEngine::new()): Loaded immediately in current thread - Additional models (via
load_model()): Lazy-loaded on first use
// First model - loaded immediately
let engine = new?;
assert!;
// Additional model - lazy loaded
engine.load_model?;
assert!;
assert!; // Not yet loaded
// Triggers actual loading in thread
engine.embed?;
assert!; // Now loaded
Performance
The library is optimized for high performance:
- Parallel tokenization for batch processing
- Efficient memory management
- Configurable thread counts
- GPU acceleration support
Benchmarks
Run benchmarks with:
EMBELLAMA_BENCH_MODEL=/path/to/model.gguf
Performance Tips
- Batch Processing: Use
embed_batch()for multiple texts - Thread Configuration: Set
n_threadsbased on CPU cores - GPU Acceleration: Enable GPU for larger models
- Warmup: Call
warmup_model()before processing
Development
For development setup, testing, and contributing guidelines, please see DEVELOPMENT.md.
Examples
See the examples/ directory for more examples:
simple.rs- Basic embedding generationbatch.rs- Batch processing examplemulti_model.rs- Using multiple modelsconfig.rs- Configuration exampleserror_handling.rs- Error handling patterns
Run examples with:
License
Licensed under the Apache License, Version 2.0. See LICENSE for details.
Contributing
Contributions are welcome! Please see DEVELOPMENT.md for development setup and contribution guidelines.
Support
For issues and questions, please use the GitHub issue tracker.