Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
RuvLLM
The local LLM inference engine that learns from every request -- Metal, CUDA, WebGPU, no cloud APIs.
RuvLLM loads GGUF models and runs them on your hardware with full acceleration -- Apple Silicon, NVIDIA GPUs, WebAssembly, whatever you have. Unlike other local inference tools, it gets smarter over time: SONA (Self-Optimizing Neural Architecture) watches how you use it and adapts automatically, so responses improve without manual tuning. It's part of RuVector, the self-learning vector database with graph intelligence.
| RuvLLM | OpenAI API | llama.cpp | Ollama | vLLM | |
|---|---|---|---|---|---|
| Cost | Free after hardware | Per-token billing | Free | Free | Free |
| Privacy | Data stays on your machine | Sent to third party | Local | Local | Local |
| Self-learning | SONA adapts automatically | Static | Static | Static | Static |
| Per-request tuning | MicroLoRA in <1 ms | Not available | Not available | Not available | Not available |
| Hardware support | Metal, CUDA, ANE, WebGPU, CPU | N/A | Metal, CUDA, CPU | Metal, CUDA, CPU | CUDA only |
| WASM / Browser | Yes (5.5 KB runtime) | Via network call | Not available | Not available | Not available |
| Vector DB integration | Built-in (RuVector) | Separate service | Not available | Not available | Not available |
| Speculative decoding | Yes | N/A | Yes | No | Yes |
| Continuous batching | Yes | N/A | No | No | Yes |
| Production serving | mistral-rs backend | N/A | Server mode | Server mode | Native |
Key Features
| Feature | What It Does | Why It Matters |
|---|---|---|
| SONA three-tier learning | Adapts to your queries at three speeds: instant (<1 ms), background (~100 ms), deep (minutes) | Responses improve automatically without manual retraining |
| Metal + CUDA + ANE | Hardware-accelerated inference across Apple Silicon, NVIDIA GPUs, and Apple Neural Engine | Get the most out of whatever hardware you have |
| Flash Attention 2 | Memory-efficient attention with O(N) complexity and online softmax | Longer contexts with less memory |
| GGUF memory mapping | Memory-mapped model loading with quantization (Q4K, Q8, FP16) | Load large models fast, use 4-8x less RAM |
| Speculative decoding | Draft model generates candidates, target model verifies in parallel | 2-3x faster text generation |
| Continuous batching | Dynamic batch scheduling for concurrent requests | 2-3x throughput improvement for serving |
| MicroLoRA | Per-request fine-tuning with rank 1-2 adapters | Personalize responses in <1 ms without full retraining |
| HuggingFace Hub | Download and upload models directly | One-line model access, easy sharing |
| mistral-rs backend | PagedAttention, X-LoRA, ISQ for production serving | Scale to 50+ concurrent users |
| Task-specific adapters | 5 pre-trained LoRA adapters (coder, researcher, security, architect, reviewer) | Instant specialization with hot-swap |
Part of the RuVector ecosystem -- the self-learning vector database with graph intelligence, local AI, and PostgreSQL built in.
Quick Start
use *;
let mut backend = with_device?;
backend.load_gguf?;
let response = backend.generate?;
println!;
Installation
Add to your Cargo.toml:
[]
# Recommended for Apple Silicon Mac
= { = "2.0", = ["inference-metal", "coreml", "parallel"] }
# For NVIDIA GPUs
= { = "2.0", = ["inference-cuda", "parallel"] }
# Minimal (CPU only)
= { = "2.0" }
Or install the npm package:
What's New in v2.3
| Feature | Description | Benefit |
|---|---|---|
| RuvLTRA-Medium 3B | Purpose-built 3B model for Claude Flow | 42 layers, 256K context, speculative decode |
| HuggingFace Hub | Full Hub integration (download/upload) | Easy model sharing and distribution |
| Task-Specific LoRA | 5 pre-trained adapters for agent types | Optimized for coder/researcher/security/architect/reviewer |
| Adapter Merging | TIES, DARE, SLERP, Task Arithmetic | Combine adapters for multi-task models |
| Hot-Swap Adapters | Zero-downtime adapter switching | Runtime task specialization |
| Claude Dataset | 2,700+ Claude-style training examples | Optimized for Claude Flow integration |
| HNSW Routing | 150x faster semantic pattern matching | <25µs pattern retrieval |
| Evaluation Harness | Real model evaluation with SWE-Bench | 5 ablation modes, quality metrics |
| HNSW Auto-Dimension | Automatic embedding dimension detection | No manual config needed |
| mistral-rs Backend | Production-scale serving with PagedAttention | 5-10x concurrent users, X-LoRA, ISQ |
Previous v2.0-2.2 Features
| Feature | Description | Benefit |
|---|---|---|
| Apple Neural Engine | Core ML backend with ANE routing | 38 TOPS, 3-4x power efficiency |
| Hybrid GPU+ANE Pipeline | Intelligent operation routing | Best of both accelerators |
| Multi-threaded GEMM | Rayon parallelization | 4-12x speedup on M4 Pro |
| Flash Attention 2 | Auto block sizing, online softmax | O(N) memory, +10% throughput |
| Quantized Inference | INT8/INT4/Q4_K/Q8_K kernels | 4-8x memory reduction |
| Metal GPU Shaders | simdgroup_matrix operations | 3x speedup on Apple Silicon |
| GGUF Support | Memory-mapped model loading | Fast loading, reduced RAM |
| Continuous Batching | Dynamic batch scheduling | 2-3x throughput improvement |
| Speculative Decoding | Draft model acceleration | 2-3x faster generation |
| Gemma-2 & Phi-3 | New model architectures | Extended model support |
Backends
| Backend | Best For | Acceleration |
|---|---|---|
| Candle | Single user, edge, WASM | Metal, CUDA, CPU |
| Core ML | Apple Silicon efficiency | Apple Neural Engine (38 TOPS) |
| Hybrid Pipeline | Maximum throughput on Mac | GPU for attention, ANE for MLP |
| mistral-rs | Production serving (10-100 users) | PagedAttention, X-LoRA, ISQ |
Feature Flags
| Feature | Description |
|---|---|
candle |
Enable Candle backend (HuggingFace) |
metal |
Apple Silicon GPU acceleration via Candle |
metal-compute |
Native Metal compute shaders (M4 Pro optimized) |
cuda |
NVIDIA GPU acceleration |
coreml |
Apple Neural Engine via Core ML |
hybrid-ane |
GPU+ANE hybrid pipeline (recommended for Mac) |
inference-metal |
Full Metal inference stack |
inference-metal-native |
Metal + native shaders (best M4 Pro perf) |
inference-cuda |
Full CUDA inference stack |
parallel |
Multi-threaded GEMM/GEMV with Rayon |
accelerate |
Apple Accelerate BLAS (~2x GEMV speedup) |
gguf-mmap |
Memory-mapped GGUF loading |
async-runtime |
Tokio async support |
wasm |
WebAssembly support |
mistral-rs |
mistral-rs backend (PagedAttention, X-LoRA, ISQ) |
mistral-rs-metal |
mistral-rs with Apple Silicon acceleration |
mistral-rs-cuda |
mistral-rs with NVIDIA CUDA acceleration |
Architecture
+----------------------------------+
| Application |
+----------------------------------+
|
+----------------------------------+
| RuvLLM Backend |
| +----------------------------+ |
| | Hybrid Pipeline Router | |
| | ┌─────────┐ ┌──────────┐ | |
| | │ Metal │ │ ANE │ | |
| | │ GPU │ │ Core ML │ | |
| | └────┬────┘ └────┬─────┘ | |
| | │ ↕ │ | |
| | Attention MLP/FFN | |
| | RoPE Activations | |
| | Softmax LayerNorm | |
| +----------------------------+ |
| | |
| +----------------------------+ |
| | SONA Learning | |
| | - Instant (<1ms) | |
| | - Background (~100ms) | |
| | - Deep (minutes) | |
| +----------------------------+ |
| | |
| +----------------------------+ |
| | NEON/SIMD Kernels | |
| | - Flash Attention 2 | |
| | - Paged KV Cache | |
| | - Quantized MatMul | |
| +----------------------------+ |
+----------------------------------+
Supported Models
| Model Family | Sizes | Quantization | Backend |
|---|---|---|---|
| RuvLTRA-Small | 0.5B | Q4K, Q5K, Q8, FP16 | Candle/Metal/ANE |
| RuvLTRA-Medium | 3B | Q4K, Q5K, Q8, FP16 | Candle/Metal |
| Qwen 2.5 | 0.5B-72B | Q4K, Q8, FP16 | Candle/Metal |
| Llama 3.x | 8B-70B | Q4K, Q8, FP16 | Candle/Metal |
| Mistral | 7B-22B | Q4K, Q8, FP16 | Candle/Metal |
| Phi-3 | 3.8B-14B | Q4K, Q8, FP16 | Candle/Metal |
| Gemma-2 | 2B-27B | Q4K, Q8, FP16 | Candle/Metal |
RuvLTRA Models (Claude Flow Optimized)
| Model | Parameters | Hidden | Layers | Context | Features |
|---|---|---|---|---|---|
| RuvLTRA-Small | 494M | 896 | 24 | 32K | GQA 7:1, SONA hooks |
| RuvLTRA-Medium | 3.0B | 2560 | 42 | 256K | Flash Attention 2, Speculative Decode |
Performance (M4 Pro 14-core)
Inference Benchmarks
| Model | Quant | Prefill (tok/s) | Decode (tok/s) | Memory |
|---|---|---|---|---|
| Qwen2.5-7B | Q4K | 2,800 | 95 | 4.2 GB |
| Qwen2.5-7B | Q8 | 2,100 | 72 | 7.8 GB |
| Llama3-8B | Q4K | 2,600 | 88 | 4.8 GB |
| Mistral-7B | Q4K | 2,500 | 85 | 4.1 GB |
| Phi-3-3.8B | Q4K | 3,500 | 135 | 2.3 GB |
| Gemma2-9B | Q4K | 2,200 | 75 | 5.2 GB |
ANE vs GPU Performance (M4 Pro)
| Dimension | ANE | GPU | Winner |
|---|---|---|---|
| < 512 | +30-50% | - | ANE |
| 512-1024 | +10-30% | - | ANE |
| 1024-1536 | ~Similar | ~Similar | Either |
| 1536-2048 | - | +10-20% | GPU |
| > 2048 | - | +30-50% | GPU |
Kernel Benchmarks
| Kernel | Single-thread | Multi-thread (10-core) |
|---|---|---|
| GEMM 4096x4096 | 1.2 GFLOPS | 12.7 GFLOPS |
| GEMV 4096x4096 | 0.8 GFLOPS | 6.4 GFLOPS |
| Flash Attention (seq=2048) | 850μs | 320μs |
| RMS Norm (4096) | 2.1μs | 0.8μs |
| RoPE (4096, 128) | 4.3μs | 1.6μs |
Apple Neural Engine (ANE) Integration
RuvLLM v2.0 includes full ANE support via Core ML:
use ;
// Create ANE-optimized backend
let backend = new?;
// Or use hybrid pipeline for best performance
use HybridPipeline;
let pipeline = new?;
ANE Routing Recommendations
| Operation | Recommended | Reason |
|---|---|---|
| Attention | GPU | Better for variable sequence lengths |
| Flash Attention | GPU | GPU memory bandwidth advantage |
| MLP/FFN | ANE | Optimal for fixed-size matmuls |
| GELU/SiLU | ANE | Dedicated activation units |
| LayerNorm/RMSNorm | ANE | Good for small dimensions |
| Embedding | GPU | Sparse operations |
MicroLoRA Real-Time Adaptation
RuvLLM supports per-request fine-tuning using MicroLoRA:
use ;
// Create MicroLoRA adapter
let config = for_hidden_dim;
let lora = new;
// Adapt on user feedback
let feedback = from_quality;
lora.adapt?;
// Apply learned updates
lora.apply_updates; // learning rate
// Get adaptation stats
let stats = lora.stats;
println!;
SONA Three-Tier Learning
Continuous improvement with three learning loops:
use ;
let config = SonaLlmConfig ;
let sona = new;
// 1. Instant Loop (<1ms): Per-request MicroLoRA
let result = sona.instant_adapt;
println!;
// 2. Background Loop (~100ms): Pattern consolidation
if let result = sona.maybe_background
// 3. Deep Loop (minutes): Full optimization
if sona.should_trigger_deep
// Check learning stats
let stats = sona.stats;
println!;
println!;
Two-Tier KV Cache
Memory-efficient caching with automatic tiering:
use ;
let config = KvCacheConfig ;
let cache = new;
cache.append?;
// Automatic migration from tail to quantized store
let stats = cache.stats;
println!;
println!;
println!;
Continuous Batching
High-throughput serving with dynamic batching:
use ;
let scheduler = new;
// Add requests
scheduler.add_request?;
// Process batch
while let Some = scheduler.get_next_batch
// Get throughput stats
let stats = scheduler.stats;
println!;
println!;
Speculative Decoding
Accelerate generation with draft models:
use ;
let config = SpeculativeConfig ;
let decoder = new?;
// Generate with speculation
let output = decoder.generate?;
println!;
println!;
GGUF Model Loading
Efficient loading with memory mapping:
use ;
let loader = new;
// Load model metadata
let metadata = loader.read_metadata?;
println!;
println!;
println!;
// Load into backend
let tensors = loader.load_tensors?;
backend.load_tensors?;
mistral-rs Backend (Production Serving)
RuvLLM v2.3 includes integration with mistral-rs for production-scale LLM serving with advanced memory management.
Note: The mistral-rs crate is not yet published to crates.io. The integration is designed and ready—enable it when mistral-rs becomes available.
Key Features
| Feature | Description | Benefit |
|---|---|---|
| PagedAttention | vLLM-style KV cache management | 5-10x concurrent users, 85-95% memory utilization |
| X-LoRA | Per-token adapter routing | <1ms routing overhead, multi-task inference |
| ISQ | In-Situ Quantization (AWQ, GPTQ, RTN) | Runtime quantization without re-export |
Usage Example
use ;
// Configure mistral-rs backend for production serving
let config = builder
// PagedAttention: Enable 50+ concurrent users
.paged_attention
// X-LoRA: Per-token adapter routing
.xlora
// ISQ: Runtime quantization
.isq
.build;
let mut backend = new?;
backend.load_model?;
// Generate with PagedAttention + X-LoRA
let response = backend.generate?;
When to Use mistral-rs vs Candle
| Scenario | Recommended Backend | Reason |
|---|---|---|
| Single user / Edge | Candle | Simpler, smaller binary |
| 10-100 concurrent users | mistral-rs | PagedAttention memory efficiency |
| Multi-task models | mistral-rs | X-LoRA per-token routing |
| Runtime quantization | mistral-rs | ISQ without model re-export |
| WASM / Browser | Candle | mistral-rs doesn't support WASM |
Feature Flags
# Enable mistral-rs (when available on crates.io)
= { = "2.3", = ["mistral-rs"] }
# With Metal acceleration (Apple Silicon)
= { = "2.3", = ["mistral-rs-metal"] }
# With CUDA acceleration (NVIDIA)
= { = "2.3", = ["mistral-rs-cuda"] }
See ADR-008: mistral-rs Integration for detailed architecture decisions.
Configuration
Environment Variables
| Variable | Description | Default |
|---|---|---|
RUVLLM_CACHE_DIR |
Model cache directory | ~/.cache/ruvllm |
RUVLLM_LOG_LEVEL |
Logging level | info |
RUVLLM_METAL_DEVICE |
Metal device index | 0 |
RUVLLM_ANE_ENABLED |
Enable ANE routing | true |
RUVLLM_SONA_ENABLED |
Enable SONA learning | true |
Model Configuration
let config = ModelConfig ;
Benchmarks
Run benchmarks with:
# Attention benchmarks
# ANE benchmarks (Mac only)
# LoRA benchmarks
# End-to-end inference
# Metal shader benchmarks
# Serving benchmarks
HuggingFace Hub Integration (v2.3)
Download and upload models to HuggingFace Hub:
use ;
// Download from Hub
let downloader = new;
let model_path = downloader.download?;
// Or use the registry for RuvLTRA models
let registry = new;
let model = registry.get?;
// Upload to Hub (requires HF_TOKEN)
let uploader = new;
let url = uploader.upload?;
println!;
Task-Specific LoRA Adapters (v2.3)
Pre-trained adapters optimized for Claude Flow agent types:
use ;
// Create adapter for specific task
let adapters = new;
let coder = adapters.create_lora?; // Rank 16, code generation
let security = adapters.create_lora?; // Rank 16, vulnerability detection
// Available adapters:
// - coder: Rank 16, Alpha 32.0, targets attention (Q,K,V,O)
// - researcher: Rank 8, Alpha 16.0, targets Q,K,V
// - security: Rank 16, Alpha 32.0, targets attention + MLP
// - architect: Rank 12, Alpha 24.0, targets Q,V + Gate,Up
// - reviewer: Rank 8, Alpha 16.0, targets Q,V
// Merge adapters for multi-task models
let merger = new;
let multi_task = merger.merge?;
// Hot-swap adapters at runtime
let mut manager = new;
manager.set_active;
manager.prepare_standby;
manager.swap?; // Zero-downtime switch
Adapter Merging Strategies
| Strategy | Description | Use Case |
|---|---|---|
| Average | Equal-weight averaging | Simple multi-task |
| WeightedSum | User-defined weights | Task importance weighting |
| SLERP | Spherical interpolation | Smooth transitions |
| TIES | Trim, Elect, Merge | Robust multi-adapter |
| DARE | Drop And REscale | Sparse merging |
| TaskArithmetic | Add/subtract vectors | Task composition |
Evaluation Harness (v2.3)
RuvLLM includes a comprehensive evaluation harness for benchmarking model quality:
use ;
// Create harness with GGUF model
let harness = with_gguf?;
// Run single evaluation
let result = harness.evaluate?;
println!;
// Run full ablation study (5 modes)
let report = harness.run_ablation_study?;
for in &report.mode_metrics
Ablation Modes
| Mode | Description | Use Case |
|---|---|---|
| Baseline | No enhancements | Control baseline |
| RetrievalOnly | HNSW pattern retrieval | Measure retrieval impact |
| AdaptersOnly | LoRA adapters | Measure adaptation impact |
| RetrievalPlusAdapters | HNSW + LoRA | Combined without SONA |
| Full | All systems (SONA + HNSW + LoRA) | Production mode |
SWE-Bench Task Loader
use SweBenchLoader;
// Load SWE-Bench tasks
let loader = new;
let tasks = loader.load_subset?; // 50 tasks from lite subset
for task in &tasks
CLI Evaluation
# Run evaluation with default settings
# Run SWE-Bench subset
# Output report
HNSW Auto-Dimension Detection
The evaluation harness automatically detects model embedding dimensions:
// HNSW router automatically uses model's hidden_size
// TinyLlama 1.1B → 2048 dimensions
// Qwen2 0.5B → 896 dimensions
// RuvLTRA-Small → 896 dimensions
// RuvLTRA-Medium → 2560 dimensions
let harness = with_config?;
Examples
See the /examples directory for:
download_test_model.rs- Download and validate modelsbenchmark_model.rs- Full inference benchmarkingrun_eval.rs- Run evaluation harness with SWE-Bench- Basic inference
- Streaming generation
- MicroLoRA adaptation
- Multi-turn chat
- Speculative decoding
- Continuous batching
- ANE hybrid inference
Error Handling
use ;
match backend.generate
npm Package
RuvLLM is also available as an npm package with native bindings:
import { RuvLLM } from '@ruvector/ruvllm';
const llm = new RuvLLM();
const response = llm.query('Explain quantum computing');
console.log(response.text);
See @ruvector/ruvllm on npm for full documentation.
License
Apache-2.0 / MIT dual license.
Contributing
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
Links
Part of RuVector -- the self-learning vector database.