gllm 0.4.1

Pure Rust library for local text embeddings and reranking with 35 supported models and quantization support
Documentation

gllm: Pure Rust Local Embeddings & Reranking

Crates.io Documentation License

gllm is a pure Rust library for local text embeddings and reranking, built on the Burn deep learning framework. It provides an OpenAI SDK-style API with zero external C dependencies, supporting static compilation.

Features

  • Text Embeddings - Convert text into high-dimensional vectors for semantic search
  • Document Reranking - Sort documents by relevance using cross-encoders
  • GPU Acceleration - WGPU backend with automatic GPU/CPU fallback
  • 35 Built-in Models - BGE, E5, Sentence Transformers, Qwen3, JINA, and more
  • Quantization Support - Int4/Int8/AWQ/GPTQ/GGUF for Qwen3 series
  • Pure Rust - Static compilation ready, no C dependencies

Installation

[dependencies]
gllm = "0.4"

Feature Flags

Feature Default Description
wgpu Yes GPU acceleration (Vulkan/DX12/Metal)
cpu No CPU-only inference (pure Rust)
tokio No Async interface support
wgpu-detect No GPU capabilities detection (VRAM, batch size)
# CPU-only
gllm = { version = "0.4", features = ["cpu"] }

# With async
gllm = { version = "0.4", features = ["tokio"] }

# With GPU detection
gllm = { version = "0.4", features = ["wgpu-detect"] }

Requirements

  • Rust 1.70+ (2021 edition)
  • Memory: 2GB minimum, 4GB+ recommended
  • GPU (optional): Vulkan, DirectX 12, Metal, or OpenGL 4.3+

Quick Start

Text Embeddings

use gllm::Client;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = Client::new("bge-small-en")?;

    let response = client
        .embeddings(["What is machine learning?", "Neural networks explained"])
        .generate()?;

    for emb in response.embeddings {
        println!("Vector: {} dimensions", emb.embedding.len());
    }
    Ok(())
}

Document Reranking

use gllm::Client;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = Client::new("bge-reranker-v2")?;

    let response = client
        .rerank("What are renewable energy benefits?", [
            "Solar power is clean and sustainable.",
            "The stock market closed higher today.",
            "Wind energy reduces carbon emissions.",
        ])
        .top_n(2)
        .return_documents(true)
        .generate()?;

    for result in response.results {
        println!("Score: {:.4}", result.score);
    }
    Ok(())
}

Async Usage

[dependencies]
gllm = { version = "0.4", features = ["tokio"] }
tokio = { version = "1", features = ["rt-multi-thread", "macros"] }
use gllm::Client;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = Client::new("bge-small-en").await?;

    let response = client
        .embeddings(["Hello world"])
        .generate()
        .await?;

    Ok(())
}

GPU Detection (v0.4.1+)

use gllm::{GpuCapabilities, GpuType};

// Detect GPU capabilities (cached after first call)
let caps = GpuCapabilities::detect();

println!("GPU: {} ({:?})", caps.name, caps.gpu_type);
println!("VRAM: {} MB", caps.vram_mb);
println!("Recommended batch size: {}", caps.recommended_batch_size);

if caps.gpu_available {
    println!("Using {} backend", caps.backend_name);
}

FallbackEmbedder (Automatic GPU/CPU Fallback)

use gllm::FallbackEmbedder;

// Automatically falls back to CPU if GPU OOMs
let embedder = FallbackEmbedder::new("bge-small-en").await?;
let vector = embedder.embed("Hello world").await?;

Supported Models

Embedding Models (23)

Model Alias Dimensions Best For
BGE Small EN bge-small-en 384 Fast English
BGE Base EN bge-base-en 768 Balanced English
BGE Large EN bge-large-en 1024 High accuracy
BGE Small ZH bge-small-zh 512 Chinese
E5 Small e5-small 384 Instruction tuned
E5 Base e5-base 768 Instruction tuned
E5 Large e5-large 1024 Instruction tuned
MiniLM L6 all-MiniLM-L6-v2 384 General purpose
MiniLM L12 all-MiniLM-L12-v2 384 General (larger)
MPNet Base all-mpnet-base-v2 768 High quality
JINA v2 Base jina-embeddings-v2-base-en 768 Modern arch
JINA v2 Small jina-embeddings-v2-small-en 384 Lightweight
JINA v4 jina-embeddings-v4 2048 Latest JINA
Qwen3 0.6B qwen3-embedding-0.6b 1024 Lightweight
Qwen3 4B qwen3-embedding-4b 2560 Balanced
Qwen3 8B qwen3-embedding-8b 4096 High accuracy
Nemotron 8B llama-embed-nemotron-8b 4096 State-of-the-art
M3E Base m3e-base 768 Chinese quality
Multilingual multilingual-MiniLM-L12-v2 384 50+ languages

Reranking Models (12)

Model Alias Speed Best For
BGE Reranker v2 bge-reranker-v2 Medium Multilingual
BGE Reranker Large bge-reranker-large Slow High accuracy
BGE Reranker Base bge-reranker-base Fast Quick reranking
MS MARCO MiniLM L6 ms-marco-MiniLM-L-6-v2 Fast Search
MS MARCO MiniLM L12 ms-marco-MiniLM-L-12-v2 Medium Better search
MS MARCO TinyBERT ms-marco-TinyBERT-L-2-v2 Very Fast Lightweight
Qwen3 Reranker 0.6B qwen3-reranker-0.6b Fast Lightweight
Qwen3 Reranker 4B qwen3-reranker-4b Medium Balanced
Qwen3 Reranker 8B qwen3-reranker-8b Slow High accuracy
JINA Reranker v3 jina-reranker-v3 Medium Latest JINA

Custom Models

// Any HuggingFace SafeTensors model
let client = Client::new("sentence-transformers/all-MiniLM-L6-v2")?;

// Or use colon notation
let client = Client::new("sentence-transformers:all-MiniLM-L6-v2")?;

Quantization (Qwen3 Series)

use gllm::ModelRegistry;

let registry = ModelRegistry::new();

// Use :suffix for quantized variants
let info = registry.resolve("qwen3-embedding-8b:int4")?;  // Int4
let info = registry.resolve("qwen3-embedding-8b:awq")?;   // AWQ
let info = registry.resolve("qwen3-reranker-4b:gptq")?;   // GPTQ

Supported quantization types: :int4, :int8, :awq, :gptq, :gguf, :fp8, :bnb4, :bnb8

Models with quantization: Qwen3 Embedding/Reranker series, Nemotron 8B

Advanced Usage

Custom Configuration

use gllm::{Client, ClientConfig, Device};

let config = ClientConfig {
    models_dir: "/custom/path".into(),
    device: Device::Auto,  // or Device::Cpu, Device::Gpu
};

let client = Client::with_config("bge-small-en", config)?;

Vector Search Example

let query_vec = client.embeddings(["search query"]).generate()?.embeddings[0].embedding.clone();
let doc_vecs = client.embeddings(documents).generate()?;

// Calculate cosine similarities
for (i, doc) in doc_vecs.embeddings.iter().enumerate() {
    let sim = cosine_similarity(&query_vec, &doc.embedding);
    println!("Doc {}: {:.4}", i, sim);
}

Model Storage

Models are cached in ~/.gllm/models/:

~/.gllm/models/
├── BAAI--bge-small-en-v1.5/
│   ├── model.safetensors
│   ├── config.json
│   └── tokenizer.json
└── ...

Performance

Backend Device Throughput (512 tokens)
WGPU RTX 4090 ~150 texts/sec
WGPU Apple M2 ~45 texts/sec
CPU Intel i7-12700K ~8 texts/sec

Testing

cargo test --lib              # Unit tests
cargo test --test integration # Integration tests
cargo test -- --ignored       # E2E tests (downloads models)

License

MIT License - see LICENSE

Acknowledgments


Built with Rust