RuVector ONNX Embeddings
Production-ready ONNX-based embedding generation for semantic search and RAG pipelines in pure Rust
This library provides a complete embedding generation system built entirely in Rust using ONNX Runtime. Designed for high-performance vector databases, semantic search engines, and AI applications.
Table of Contents
- Features
- Quick Start
- Installation
- Supported Models
- Tutorial: Step-by-Step Guide
- Configuration Reference
- Pooling Strategies
- Performance Benchmarks
- API Reference
- Architecture
- Troubleshooting
Features
| Feature | Description | Status |
|---|---|---|
| Native ONNX Runtime | Direct ONNX model execution via ort 2.0 |
✅ |
| Pretrained Models | 8 popular sentence-transformer models | ✅ |
| HuggingFace Integration | Download any compatible model from HF Hub | ✅ |
| Multiple Pooling | Mean, CLS, Max, MeanSqrtLen, LastToken, WeightedMean | ✅ |
| Batch Processing | Efficient batch embedding with configurable size | ✅ |
| GPU Acceleration | CUDA, TensorRT, CoreML support | ✅ |
| Vector Search | Built-in similarity search (cosine, euclidean, dot) | ✅ |
| RAG Pipeline | Ready-to-use retrieval-augmented generation | ✅ |
| Thread-Safe | Safe concurrent use via RwLock | ✅ |
| Zero Python | Pure Rust - no Python dependencies | ✅ |
Quick Start
use ;
async
Installation
Step 1: Add Dependencies
[]
= { = "examples/onnx-embeddings" }
= { = "1", = ["full"] }
= "1.0"
Step 2: Choose Features (Optional)
| Feature | Command | Description |
|---|---|---|
| Default | cargo build |
CPU inference |
| CUDA | cargo build --features cuda |
NVIDIA GPU |
| TensorRT | cargo build --features tensorrt |
NVIDIA optimized |
| CoreML | cargo build --features coreml |
Apple Silicon |
Step 3: Run Examples
# Basic example
# Full demo with all features
Supported Models
Model Comparison Table
| Model | Dimension | Max Tokens | Size | Speed | Quality | Best For |
|---|---|---|---|---|---|---|
AllMiniLmL6V2 |
384 | 256 | 23MB | ⚡⚡⚡ | ⭐⭐⭐ | Default - Fast, general-purpose |
AllMiniLmL12V2 |
384 | 256 | 33MB | ⚡⚡ | ⭐⭐⭐⭐ | Better quality, balanced |
AllMpnetBaseV2 |
768 | 384 | 110MB | ⚡ | ⭐⭐⭐⭐⭐ | Best quality, production |
E5SmallV2 |
384 | 512 | 33MB | ⚡⚡⚡ | ⭐⭐⭐⭐ | Search & retrieval |
E5BaseV2 |
768 | 512 | 110MB | ⚡ | ⭐⭐⭐⭐⭐ | High-quality search |
BgeSmallEnV15 |
384 | 512 | 33MB | ⚡⚡⚡ | ⭐⭐⭐⭐ | State-of-the-art small |
BgeBaseEnV15 |
768 | 512 | 110MB | ⚡ | ⭐⭐⭐⭐⭐ | Best overall quality |
GteSmall |
384 | 512 | 33MB | ⚡⚡⚡ | ⭐⭐⭐⭐ | Multilingual support |
Model Selection Flowchart
┌─────────────────────────────────────────────────────────────────┐
│ Which Model Should I Use? │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Priority: Speed? ──────► AllMiniLmL6V2 (23MB, 384d) │
│ │
│ Priority: Quality? ──────► AllMpnetBaseV2 (110MB, 768d) │
│ │
│ Building search? ──────► BgeSmallEnV15 or E5SmallV2 │
│ │
│ Multilingual? ──────► GteSmall │
│ │
│ Production RAG? ──────► BgeBaseEnV15 or E5BaseV2 │
│ │
│ Memory constrained? ──────► AllMiniLmL6V2 │
│ │
└─────────────────────────────────────────────────────────────────┘
Tutorial: Step-by-Step Guide
Step 1: Basic Embedding Generation
Goal: Generate your first embedding and understand the output.
use ;
async
Expected Output:
Loading model...
Model: all-MiniLM-L6-v2
Dimension: 384
Max tokens: 256
Input: "The quick brown fox jumps over the lazy dog."
Output shape: [384 dimensions]
First 5 values: [0.0234, -0.0156, 0.0891, -0.0412, 0.0567]
Similarity comparisons:
"I love programming in Rust." vs "Rust is my favorite programming language."
Similarity: 0.8523 (high - related topics)
"I love programming in Rust." vs "The weather is nice today."
Similarity: 0.1234 (low - unrelated topics)
Step 2: Batch Processing
Goal: Efficiently process multiple texts at once.
use ;
async
Performance Table: Batch Size vs Throughput
| Batch Size | Time (8 texts) | Throughput | Memory |
|---|---|---|---|
| 1 | 45ms | 178/sec | 150MB |
| 8 | 35ms | 228/sec | 160MB |
| 32 | 28ms | 285/sec | 180MB |
| 64 | 25ms | 320/sec | 200MB |
Step 3: Building a Semantic Search Engine
Goal: Create a searchable knowledge base with semantic understanding.
use ;
async
Search Results Table:
| Query | Top Result | Score |
|---|---|---|
| "What language is best for web development?" | "JavaScript is the language of the web..." | 0.82 |
| "high-performance system application" | "Rust is a systems programming language..." | 0.78 |
| "machine learning" | "Python is widely used for machine learning..." | 0.85 |
| "mobile app development" | "Swift is Apple's modern language for iOS..." | 0.76 |
Step 4: Creating a RAG Pipeline
Goal: Build a retrieval-augmented generation system for LLM context.
use ;
async
RAG Pipeline Flow:
┌──────────┐ ┌─────────────┐ ┌──────────┐ ┌─────────┐
│ Query │───►│ Embedder │───►│ Search │───►│ Context │
│ │ │ │ │ Index │ │ │
└──────────┘ └─────────────┘ └──────────┘ └────┬────┘
│
v
┌──────────┐ ┌─────────────┐ ┌──────────┐ ┌─────────┐
│ Response │◄───│ LLM │◄───│ Prompt │◄───│ Format │
│ │ │ (external) │ │ │ │ │
└──────────┘ └─────────────┘ └──────────┘ └─────────┘
Step 5: Text Clustering
Goal: Automatically group similar texts together.
use Embedder;
async
Expected Clustering Output:
| Cluster | Category | Texts |
|---|---|---|
| 0 | Technology | AI revolutionizing..., ML algorithms..., Neural networks... |
| 1 | Sports | Football popular..., Basketball speed..., Tennis courts... |
| 2 | Food | Italian pasta..., Sushi traditional..., French cuisine... |
Configuration Reference
EmbedderConfig Options
| Option | Type | Default | Description |
|---|---|---|---|
model_source |
ModelSource |
Pretrained | Where to load model from |
batch_size |
usize |
32 | Texts per inference batch |
max_length |
usize |
512 | Maximum tokens per text |
pooling |
PoolingStrategy |
Mean | Token aggregation method |
normalize |
bool |
true | L2 normalize embeddings |
num_threads |
usize |
4 | ONNX Runtime threads |
cache_dir |
PathBuf |
~/.cache/ruvector | Model cache directory |
show_progress |
bool |
true | Show download progress |
optimize_graph |
bool |
true | ONNX graph optimization |
Using EmbedderBuilder
use ;
let embedder = new
.pretrained // Choose model
.batch_size // Batch size
.max_length // Max tokens
.pooling // Pooling strategy
.normalize // L2 normalize
.build
.await?;
Pooling Strategies
| Strategy | Method | Best For | Example Use |
|---|---|---|---|
Mean |
Average all tokens | General purpose | Default choice |
Cls |
[CLS] token only | BERT-style models | Classification |
Max |
Max across tokens | Keyword matching | Entity extraction |
MeanSqrtLen |
Mean / sqrt(len) | Length-invariant | Mixed-length comparison |
LastToken |
Final token | Decoder models | GPT-style |
WeightedMean |
Position-weighted | Custom scenarios | Special cases |
Choosing a Strategy
Text Type Recommended Strategy
─────────────────────────────────────────
Short sentences Mean (default)
Long documents MeanSqrtLen
BERT fine-tuned Cls
Keyword search Max
Decoder models LastToken
Performance Benchmarks
Embedding Generation Speed
Tested on AMD EPYC 7763 (64-core), Ubuntu 22.04
| Configuration | Single Text | Batch 32 | Batch 128 | Throughput |
|---|---|---|---|---|
| CPU (1 thread) | 22ms | 180ms | 680ms | 188/sec |
| CPU (8 threads) | 18ms | 85ms | 310ms | 413/sec |
| CUDA A100 | 4ms | 15ms | 45ms | 2,844/sec |
| TensorRT A100 | 2ms | 8ms | 25ms | 5,120/sec |
Memory Usage
| Model | Parameters | ONNX Size | Runtime RAM | GPU VRAM |
|---|---|---|---|---|
| AllMiniLmL6V2 | 22M | 23MB | 150MB | 200MB |
| AllMpnetBaseV2 | 109M | 110MB | 400MB | 600MB |
| BgeBaseEnV15 | 109M | 110MB | 400MB | 600MB |
Similarity Search Latency
| Index Size | Insert Time | Search (top-10) | Memory |
|---|---|---|---|
| 1,000 | 0.5s | 0.2ms | 2MB |
| 10,000 | 4s | 0.5ms | 15MB |
| 100,000 | 40s | 2ms | 150MB |
| 1,000,000 | 7min | 8ms | 1.5GB |
API Reference
Core Types
// Main Embedder
;
// Search Index
;
// RAG Pipeline
;
Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ RuVector ONNX Embeddings │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌───────────┐ │
│ │ Text │ -> │ Tokenizer │ -> │ ONNX │ -> │ Pooling │ │
│ │ Input │ │ (HF Rust) │ │ Runtime │ │ Strategy │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └───────────┘ │
│ │ │
│ v │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌───────────┐ │
│ │ Search │ <- │ Vector │ <- │ Normalize │ <- │ Embedding │ │
│ │ Results │ │ Index │ │ (L2) │ │ Vector │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └───────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Troubleshooting
Common Issues and Solutions
| Issue | Cause | Solution |
|---|---|---|
| Model download fails | Network/firewall | Use local model or check connection |
| Out of memory | Large model/batch | Reduce batch_size or use smaller model |
| Slow inference | CPU-bound | Enable GPU or increase num_threads |
| Dimension mismatch | Different models | Ensure same model for index and query |
| CUDA not found | Missing driver | Install CUDA toolkit and drivers |
Debugging Tips
// Enable verbose logging
set_var;
init;
// Check model loading
let embedder = default_model.await?;
println!;
println!;
Running Benchmarks
# Run all benchmarks
# Generate HTML report
Examples
# Basic embedding
# Batch processing
# Semantic search
# Full interactive demo
License
MIT License - See LICENSE for details.
Built with Rust for the RuVector ecosystem.