gllm: Pure Rust Local Embeddings & Reranking
gllm is a pure Rust library for local text embeddings and reranking, built on the Burn deep learning framework. It provides an OpenAI SDK-style API with zero external C dependencies, supporting static compilation.
Features
- Text Embeddings - Convert text into high-dimensional vectors for semantic search
- Document Reranking - Sort documents by relevance using cross-encoders
- Code Embeddings - Specialized models for code semantic similarity (CodeXEmbed)
- GPU Acceleration - WGPU backend with automatic GPU/CPU fallback
- 50+ Built-in Models - BGE, E5, Sentence Transformers, Qwen2.5, Qwen3, GLM-4, JINA, CodeXEmbed, and more
- Encoder & Decoder Architectures - BERT-style encoders and Qwen2.5/GLM-4/Mistral-style decoders
- Quantization Support - Int4/Int8/AWQ/GPTQ/GGUF for Qwen3 series
- Pure Rust - Static compilation ready, no C dependencies
Installation
[]
= "0.7"
Feature Flags
| Feature | Default | Description |
|---|---|---|
wgpu |
Yes | GPU acceleration (Vulkan/DX12/Metal) |
cpu |
No | CPU-only inference (pure Rust) |
tokio |
No | Async interface support |
wgpu-detect |
No | GPU capabilities detection (VRAM, batch size) |
# CPU-only
= { = "0.4", = ["cpu"] }
# With async
= { = "0.4", = ["tokio"] }
# With GPU detection
= { = "0.4", = ["wgpu-detect"] }
Requirements
- Rust 1.70+ (2021 edition)
- Memory: 2GB minimum, 4GB+ recommended
- GPU (optional): Vulkan, DirectX 12, Metal, or OpenGL 4.3+
Quick Start
Text Embeddings
use Client;
Document Reranking
use Client;
Async Usage
[]
= { = "0.5", = ["tokio"] }
= { = "1", = ["rt-multi-thread", "macros"] }
use Client;
async
GPU Detection (v0.4.1+)
use ;
// Detect GPU capabilities (cached after first call)
let caps = detect;
println!;
println!;
println!;
if caps.gpu_available
FallbackEmbedder (Automatic GPU/CPU Fallback)
use FallbackEmbedder;
// Automatically falls back to CPU if GPU OOMs
let embedder = new.await?;
let vector = embedder.embed.await?;
Code Embeddings (v0.5.0+)
CodeXEmbed models are optimized for code semantic similarity, outperforming Voyage-Code by 20%+ on CoIR benchmark.
use Client;
For larger models with higher accuracy:
// CodeXEmbed-2B (1536 dimensions, Qwen2-based decoder)
let client = new?;
// CodeXEmbed-7B (4096 dimensions, Mistral-based decoder)
let client = new?;
Qwen3 Large Language Model Embeddings
Qwen3 series provides state-of-the-art embeddings with decoder architecture and quantization support.
use Client;
With quantization support for memory efficiency:
use registry;
// Quantized Qwen3 models (reduced memory, maintained quality)
let info = resolve?; // Int4 quantization
let info = resolve?; // Int8 quantization
let info = resolve?; // AWQ quantization
Qwen3 Reranker
High-accuracy document reranking with LLM-based cross-encoder:
use Client;
Text Generation (v0.6.0+)
Generate text using decoder-based LLMs like Qwen2.5, GLM-4, and Mistral:
use Client;
With streaming support (coming soon):
// Future API for streaming
let stream = client
.generate
.max_tokens
.stream?;
for token in stream
Supported Models
Embedding Models (27)
| Model | Alias | Dimensions | Architecture | Best For |
|---|---|---|---|---|
| BGE Small EN | bge-small-en |
384 | Encoder | Fast English |
| BGE Base EN | bge-base-en |
768 | Encoder | Balanced English |
| BGE Large EN | bge-large-en |
1024 | Encoder | High accuracy |
| BGE Small ZH | bge-small-zh |
512 | Encoder | Chinese |
| E5 Small | e5-small |
384 | Encoder | Instruction tuned |
| E5 Base | e5-base |
768 | Encoder | Instruction tuned |
| E5 Large | e5-large |
1024 | Encoder | Instruction tuned |
| MiniLM L6 | all-MiniLM-L6-v2 |
384 | Encoder | General purpose |
| MiniLM L12 | all-MiniLM-L12-v2 |
384 | Encoder | General (larger) |
| MPNet Base | all-mpnet-base-v2 |
768 | Encoder | High quality |
| JINA v2 Base | jina-embeddings-v2-base-en |
768 | Encoder | Modern arch |
| JINA v2 Small | jina-embeddings-v2-small-en |
384 | Encoder | Lightweight |
| JINA v4 | jina-embeddings-v4 |
2048 | Encoder | Latest JINA |
| Qwen3 0.6B | qwen3-embedding-0.6b |
1024 | Encoder | Lightweight |
| Qwen3 4B | qwen3-embedding-4b |
2560 | Encoder | Balanced |
| Qwen3 8B | qwen3-embedding-8b |
4096 | Encoder | High accuracy |
| Nemotron 8B | llama-embed-nemotron-8b |
4096 | Encoder | State-of-the-art |
| M3E Base | m3e-base |
768 | Encoder | Chinese quality |
| Multilingual | multilingual-MiniLM-L12-v2 |
384 | Encoder | 50+ languages |
Code Embedding Models (4) - NEW in v0.5.0
| Model | Alias | Dimensions | Architecture | Best For |
|---|---|---|---|---|
| CodeXEmbed 400M | codexembed-400m |
1024 | Encoder (BERT) | Fast code search |
| CodeXEmbed 2B | codexembed-2b |
1536 | Decoder (Qwen2) | Balanced code |
| CodeXEmbed 7B | codexembed-7b |
4096 | Decoder (Mistral) | High accuracy code |
| GraphCodeBERT | graphcodebert-base |
768 | Encoder | Legacy code |
CodeXEmbed (SFR-Embedding-Code) is the 2024 state-of-the-art for code embedding, outperforming Voyage-Code by 20%+ on CoIR benchmark.
Generator Models (12) - NEW in v0.6.0+
| Model | Alias | Parameters | Architecture | Best For |
|---|---|---|---|---|
| Qwen2.5 0.5B Instruct | qwen2.5-0.5b-instruct |
0.5B | Decoder (Qwen2) | Fast generation |
| Qwen2.5 1.5B Instruct | qwen2.5-1.5b-instruct |
1.5B | Decoder (Qwen2) | Lightweight |
| Qwen2.5 3B Instruct | qwen2.5-3b-instruct |
3B | Decoder (Qwen2) | Balanced |
| Qwen2.5 7B Instruct | qwen2.5-7b-instruct |
7B | Decoder (Qwen2) | High quality |
| Qwen2.5 14B Instruct | qwen2.5-14b-instruct |
14B | Decoder (Qwen2) | Very high quality |
| Qwen2.5 32B Instruct | qwen2.5-32b-instruct |
32B | Decoder (Qwen2) | Premium quality |
| Qwen2.5 72B Instruct | qwen2.5-72b-instruct |
72B | Decoder (Qwen2) | Maximum quality |
| GLM-4 9B Chat | glm-4-9b-chat |
9B | Decoder (GLM4) | Chinese & English |
| Qwen2 7B Instruct | qwen2-7b-instruct |
7B | Decoder (Qwen2) | Legacy |
| Mistral 7B Instruct | mistral-7b-instruct |
7B | Decoder (Mistral) | Legacy |
Qwen2.5 is the 2025 state-of-the-art open-source LLM family with 128K context and excellent multilingual support. GLM-4 is Zhipu AI's flagship model with 131K context and strong Chinese/English performance.
Reranking Models (12)
| Model | Alias | Speed | Best For |
|---|---|---|---|
| BGE Reranker v2 | bge-reranker-v2 |
Medium | Multilingual |
| BGE Reranker Large | bge-reranker-large |
Slow | High accuracy |
| BGE Reranker Base | bge-reranker-base |
Fast | Quick reranking |
| MS MARCO MiniLM L6 | ms-marco-MiniLM-L-6-v2 |
Fast | Search |
| MS MARCO MiniLM L12 | ms-marco-MiniLM-L-12-v2 |
Medium | Better search |
| MS MARCO TinyBERT | ms-marco-TinyBERT-L-2-v2 |
Very Fast | Lightweight |
| Qwen3 Reranker 0.6B | qwen3-reranker-0.6b |
Fast | Lightweight |
| Qwen3 Reranker 4B | qwen3-reranker-4b |
Medium | Balanced |
| Qwen3 Reranker 8B | qwen3-reranker-8b |
Slow | High accuracy |
| JINA Reranker v3 | jina-reranker-v3 |
Medium | Latest JINA |
Custom Models
// Any HuggingFace SafeTensors model
let client = new?;
// Or use colon notation
let client = new?;
Quantization (Qwen3 Series)
use ModelRegistry;
let registry = new;
// Use :suffix for quantized variants
let info = registry.resolve?; // Int4
let info = registry.resolve?; // AWQ
let info = registry.resolve?; // GPTQ
Supported quantization types: :int4, :int8, :awq, :gptq, :gguf, :fp8, :bnb4, :bnb8
Models with quantization: Qwen3 Embedding/Reranker series, Nemotron 8B
Advanced Usage
Custom Configuration
use ;
let config = ClientConfig ;
let client = with_config?;
Vector Search Example
let query_vec = client.embeddings.generate?.embeddings.embedding.clone;
let doc_vecs = client.embeddings.generate?;
// Calculate cosine similarities
for in doc_vecs.embeddings.iter.enumerate
Model Storage
Models are cached in ~/.gllm/models/:
~/.gllm/models/
├── BAAI--bge-small-en-v1.5/
│ ├── model.safetensors
│ ├── config.json
│ └── tokenizer.json
└── ...
Performance
| Backend | Device | Throughput (512 tokens) |
|---|---|---|
| WGPU | RTX 4090 | ~150 texts/sec |
| WGPU | Apple M2 | ~45 texts/sec |
| CPU | Intel i7-12700K | ~8 texts/sec |
Testing
License
MIT License - see LICENSE
Acknowledgments
Built with Rust