gllm: Pure Rust Local Embeddings & Reranking
gllm is a pure Rust library for local text embeddings and reranking, built on the Burn deep learning framework. It provides an OpenAI SDK-style API with zero external C dependencies, supporting static compilation.
β¨ What You Can Do With gllm
- π Generate text embeddings - Convert text into high-dimensional vectors for semantic search
- π Rerank documents - Sort documents by relevance to a query using cross-encoders
- β‘ High performance - GPU acceleration with WGPU or CPU-only inference
- π― Production ready - Pure Rust implementation with static compilation support
- π Easy to use - OpenAI-style API with builder patterns
π¦ Installation
Requirements
- Rust 1.70+ (2021 edition)
- Memory - Minimum 2GB RAM, 4GB+ recommended for larger models
- GPU (Optional) - For faster inference with WGPU backend
Step 1: Add to Cargo.toml
[]
= "0.2"
Step 2: Choose Your Backend
# Option 1: Default (WGPU GPU support + CPU fallback)
= "0.2"
# Option 2: CPU-only (no GPU dependencies, pure Rust)
= { = "0.2", = ["cpu"] }
# Option 3: With async support (tokio)
= { = "0.2", = ["tokio"] }
# Option 4: CPU-only + async
= { = "0.2", = ["cpu", "tokio"] }
Step 3: Start Using (5 minutes)
See the Quick Start section below.
βΉοΈ Feature Flags
| Feature | Default | Description |
|---|---|---|
wgpu |
β | GPU acceleration using WGPU (Vulkan/DX12/Metal) |
cpu |
β | CPU-only inference using ndarray (pure Rust) |
tokio |
β | Async interface support (same API, add .await) |
GPU Support
The WGPU backend supports:
- NVIDIA GPUs - via Vulkan or CUDA
- AMD GPUs - via Vulkan or DirectX
- Intel GPUs - via Vulkan or DirectX
- Apple Silicon - via Metal
- Intel/AMD CPUs - Fallback to CPU compute
π― Supported Models
gllm includes built-in aliases for 26 popular models and supports any HuggingFace SafeTensors model.
Built-in Model Aliases (26 Models)
π Text Embedding Models (18 models)
| Alias | HuggingFace Model | Dimensions | Speed | Best For |
|---|---|---|---|---|
| BGE Series | ||||
bge-small-zh |
BAAI/bge-small-zh-v1.5 |
512 | Fast | π¨π³ Chinese, lightweight |
bge-small-en |
BAAI/bge-small-en-v1.5 |
384 | Fast | πΊπΈ English, lightweight |
bge-base-en |
BAAI/bge-base-en-v1.5 |
768 | Medium | πΊπΈ English balanced |
bge-large-en |
BAAI/bge-large-en-v1.5 |
1024 | Slow | πΊπΈ English high accuracy |
| Sentence Transformers | | | | |
| all-MiniLM-L6-v2 | sentence-transformers/all-MiniLM-L6-v2 | 384 | Fast | π― General purpose |
| all-mpnet-base-v2 | sentence-transformers/all-mpnet-base-v2 | 768 | Medium | π― High quality English |
| paraphrase-MiniLM-L6-v2 | sentence-transformers/paraphrase-MiniLM-L6-v2 | 384 | Fast | π Paraphrase detection |
| multi-qa-mpnet-base-dot-v1 | sentence-transformers/multi-qa-mpnet-base-dot-v1 | 768 | Medium | β Question answering |
| all-MiniLM-L12-v2 | sentence-transformers/all-MiniLM-L12-v2 | 384 | Fast | π― General purpose (larger) |
| all-distilroberta-v1 | sentence-transformers/all-distilroberta-v1 | 768 | Medium | β‘ Fast inference |
| E5 Series | | | | |
| e5-large | intfloat/e5-large | 1024 | Slow | π― Instruction tuned |
| e5-base | intfloat/e5-base | 768 | Medium | π― Instruction tuned |
| e5-small | intfloat/e5-small | 384 | Fast | β‘ Lightweight instruction tuned |
| JINA Embeddings | | | | |
| jina-embeddings-v2-base-en | jinaai/jina-embeddings-v2-base-en | 768 | Medium | π― Modern architecture |
| jina-embeddings-v2-small-en | jinaai/jina-embeddings-v2-small-en | 384 | Fast | β‘ Lightweight modern |
| Chinese Models | | | | |
| m3e-base | moka-ai/m3e-base | 768 | Medium | π¨π³ Chinese, high quality |
| Multilingual | | | | |
| multilingual-MiniLM-L12-v2 | sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 | 384 | Medium | π 50+ languages |
| distiluse-base-multilingual-cased-v1 | sentence-transformers/distiluse-base-multilingual-cased-v1 | 512 | Medium | π Multilingual cased |
π― Document Reranking Models (8 models)
| Alias | HuggingFace Model | Speed | Best For |
|---|---|---|---|
| BGE Rerankers | |||
bge-reranker-v2 |
BAAI/bge-reranker-v2-m3 |
Medium | π Multilingual reranking |
bge-reranker-large |
BAAI/bge-reranker-large |
Slow | π― High accuracy |
bge-reranker-base |
BAAI/bge-reranker-base |
Fast | β‘ Fast reranking |
| MS MARCO Rerankers | | | |
| ms-marco-MiniLM-L-6-v2 | cross-encoder/ms-marco-MiniLM-L-6-v2 | Fast | π― Search relevance |
| ms-marco-MiniLM-L-12-v2 | cross-encoder/ms-marco-MiniLM-L-12-v2 | Medium | π― Higher accuracy search |
| ms-marco-TinyBERT-L-2-v2 | cross-encoder/ms-marco-TinyBERT-L-2-v2 | Very Fast | β‘ Lightweight reranking |
| ms-marco-electra-base | cross-encoder/ms-marco-electra-base | Medium | β‘ Efficient reranking |
| Specialized Rerankers | | | |
| quora-distilroberta-base | cross-encoder/quora-distilroberta-base | Medium | β Question similarity |
Using Custom Models
You can use any HuggingFace SafeTensors model directly:
// Use any HuggingFace SafeTensors model
let client = new?;
// Or use colon notation for shorthand
let client = new?;
ποΈ Model Selection Guide
Embedding Models - Choose Based On:
π Speed & Efficiency
bge-small-en/e5-small/all-MiniLM-L6-v2- Fastest, 384 dims- Perfect for high-throughput applications
βοΈ Balance of Speed & Accuracy
bge-base-en/e5-base/all-mpnet-base-v2- 768 dims- Great general-purpose choice
π― High Accuracy
bge-large-en/e5-large- 1024 dims- Best for quality-critical applications
π Multilingual & Chinese Support
bge-small-zh- 512 dims, Chinese optimizedm3e-base- 768 dims, Chinese high qualitymultilingual-MiniLM-L12-v2- 384 dims, 50+ languages
Reranking Models - Choose Based On:
β‘ Fast Reranking
bge-reranker-base/ms-marco-TinyBERT-L-2-v2- Best for real-time applications
π― Balanced Performance
bge-reranker-v2/ms-marco-MiniLM-L-6-v2- Good accuracy with reasonable speed
π High Accuracy
bge-reranker-large/ms-marco-MiniLM-L-12-v2- Maximum quality for batch processing
Model Requirements
- Embedding Models: BERT-style encoder models with SafeTensors weights
- Rerank Models: Cross-encoder models with SafeTensors weights
- Format: SafeTensors (
.safetensorsfiles) - Tokenizer: HuggingFace compatible tokenizer files
π Quick Start
Text Embeddings
Generate semantic embeddings for search, clustering, or similarity matching:
use Client;
Document Reranking
Sort documents by relevance to improve search results:
use Client;
Async Support
For async applications, enable the tokio feature:
[]
= { = "0.2", = ["tokio"] }
= { = "1", = ["rt-multi-thread", "macros"] }
use Client;
async
π§ Advanced Usage
Custom Configuration
use ;
let config = ClientConfig ;
let client = with_config?;
Batch Processing
let texts: = vec!;
let response = client.embeddings.generate?;
// Process embeddings efficiently
for embedding in response.embeddings
Vector Search
use HashMap;
let query = "machine learning tutorials";
let documents = vec!;
// Generate query embedding
let query_response = client.embeddings.generate?;
let query_vec = &query_response.embeddings.embedding;
// Generate document embeddings
let doc_response = client.embeddings.generate?;
// Calculate similarities and find best matches
let mut similarities = Vecnew;
for in doc_response.embeddings.iter.enumerate
// Sort by similarity (descending)
similarities.sort_by;
println!;
for in similarities.iter.take
ποΈ Architecture
What Makes gllm Special
- 100% Pure Rust - No C/C++ dependencies, enabling static compilation
- Static Compilation Ready - Build self-contained binaries with
--target x86_64-unknown-linux-musl - SafeTensors Only - Secure model format with built-in validation
- Auto Model Management - Download and cache models from HuggingFace automatically
- Flexible Backends - GPU (WGPU) or CPU inference based on your needs
- OpenAI-Compatible API - Familiar builder patterns and response structures
Model Storage
Models are automatically downloaded and cached in ~/.gllm/models/:
~/.gllm/models/
βββ BAAI--bge-m3/
β βββ model.safetensors # Model weights
β βββ config.json # Model configuration
β βββ tokenizer.json # Tokenizer
β βββ tokenizer_config.json # Tokenizer config
βββ BAAI--bge-reranker-v2-m3/
βββ ...
π οΈ Installation & Requirements
System Requirements
- Rust 1.70+ (2021 edition)
- GPU (Optional) - For WGPU backend:
- Vulkan, DirectX 12, Metal, or OpenGL 4.3+ support
- Memory - Minimum 2GB RAM, 4GB+ recommended for larger models
Feature Flags
| Feature | Default | Description |
|---|---|---|
wgpu |
β | GPU acceleration using WGPU |
cpu |
β | CPU-only inference using ndarray |
tokio |
β | Async interface support (same API, add .await) |
Build Examples
# Default (GPU + CPU fallback)
# CPU-only
# Async support
# Static compilation
π Performance
Benchmarks (BGE-M3, 512-length text)
| Backend | Device | Throughput | Memory Usage |
|---|---|---|---|
| WGPU | RTX 4090 | ~150 texts/sec | ~1.2GB |
| WGPU | Apple M2 | ~45 texts/sec | ~800MB |
| CPU | Intel i7-12700K | ~8 texts/sec | ~600MB |
Results vary by model size and hardware
π§ͺ Testing
Run the complete test suite:
# Unit tests only
# Integration tests (fast, no model downloads needed)
# E2E tests with real model downloads and inference (requires models in ~/.gllm/models/)
# All tests including E2E
# Verbose output
Test Coverage
- β 10 unit tests (model configs, registry, pooling)
- β 14 integration tests (API, error handling, features)
- β 8 E2E tests (real model downloads and inference for all 26 models)
π€ Contributing
Contributions are welcome! Please read our Contributing Guide for details.
Development Setup
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Acknowledgments
- Burn Framework - Pure Rust deep learning framework
- HuggingFace - Model hosting and tokenizers
- BGE Models - High-quality embedding models
- SafeTensors - Secure model format
π Related Projects
Built with β€οΈ in pure Rust π¦