RuvLLM v2.0 - High-Performance LLM Inference for Rust
RuvLLM is a production-ready Rust LLM inference engine optimized for Apple Silicon (M1-M4), featuring real-time fine-tuning, NEON SIMD acceleration, Apple Neural Engine integration, and the SONA self-optimizing neural architecture.
What's New in v2.0
Major Features
| Feature | Description | Benefit |
|---|---|---|
| RLM (Recursive Language Model) | Recursive query decomposition for complex reasoning | Break down complex questions, parallel sub-query processing |
| RuvLTRA-Medium 3B | Purpose-built 3B model for Claude Flow | 42 layers, 256K context, speculative decode |
| HuggingFace Hub | Full Hub integration (download/upload) | Easy model sharing & distribution |
| Task-Specific LoRA | 5 pre-trained adapters for agent types | Optimized for coder/researcher/security/architect/reviewer |
| Adapter Merging | TIES, DARE, SLERP, Task Arithmetic | Combine adapters for multi-task models |
| Hot-Swap Adapters | Zero-downtime adapter switching | Runtime task specialization |
| WASM Support | WebAssembly target for browser-based inference | Run LLMs in the browser |
| HNSW Routing | 150x faster semantic pattern matching | <25us pattern retrieval |
Performance Optimizations (NEW)
The v2.0 release includes significant performance improvements across all hot paths:
| Optimization | Description | Benefit |
|---|---|---|
| HNSW Index | O(log n) approximate nearest neighbor search | 10x faster at 10k entries vs linear scan |
| O(1) LRU Cache | Using lru crate for cache operations |
23.5ns cache lookup (vs 500ns+ HashMap) |
| Zero-Copy Types | Arc<str>, Arc<[f32]> for shared data |
100-1000x improvement in cache hit paths |
| Batch SIMD | AVX2/NEON vectorized batch operations | 4x throughput for similarity search |
| Memory Pools | Pre-allocated vector/string pools | 50% fewer allocations in hot paths |
Benchmark Results
Measured on Apple M4 Pro with 384-dimensional embeddings:
| Operation | Performance | Notes |
|---|---|---|
| Query decomposition | 340 ns | Pattern-based keyword extraction |
| Cache lookup | 23.5 ns | O(1) LRU with FNV-1a hashing |
| Memory search (10k entries) | ~0.4 ms | With HNSW index (vs 4ms linear) |
| Embeddings (384d) | 293 ns | SIMD-accelerated dot product |
| Batch cosine (4x384d) | ~1.1 us | AVX2/NEON batch processing |
| Pool acquire/release | <100 ns | Zero-allocation in steady state |
New Optimization Modules
| Module | Purpose | Key Types |
|---|---|---|
rlm/pool.rs |
Memory pools for allocation reuse | VectorPool, StringPool, PooledVec |
rlm/shared_types.rs |
Zero-copy shared types | SharedText, SharedEmbedding, SharedQueryResult |
rlm/simd_ops.rs |
SIMD-accelerated vector operations | batch_cosine_similarity_4, batch_dot_products |
rlm/cache.rs |
O(1) LRU memoization cache | MemoizationCache, CacheEntry |
RLM (Recursive Language Model) Architecture
RLM provides a sophisticated recursive reasoning pipeline:
+------------------+
| RlmController | <-- Main entry point
+--------+---------+
|
v
+--------+---------+
| QueryDecomposer | <-- Breaks complex queries
+--------+---------+
|
+-----+-----+
| |
+--v--+ +--v--+
|Sub | |Sub | <-- Parallel sub-query processing
|Query| |Query|
+--+--+ +--+--+
| |
+-----+-----+
|
v
+--------+---------+
|AnswerSynthesizer | <-- Combines sub-answers
+--------+---------+
|
v
+--------+---------+
| RlmMemory | <-- HNSW-indexed retrieval
+-----------------+
RLM Quick Start
use ;
// Create controller with default config
let config = default;
let controller = new?;
// Query the model with recursive decomposition
let response = controller.query?;
println!;
// Add to memory for future retrieval
controller.add_memory?;
// Search memory semantically
let results = controller.search_memory?;
RLM Configuration
use ;
let config = new
.max_depth // Maximum recursion depth
.token_budget // Total token budget
.enable_cache // Enable memoization
.aggregation
.parallel_subqueries // Process sub-queries in parallel
.build?;
Previous Features (v1.x-2.x)
| Feature | Description | Benefit |
|---|---|---|
| Apple Neural Engine | Core ML backend with ANE routing | 38 TOPS, 3-4x power efficiency |
| Hybrid GPU+ANE Pipeline | Intelligent operation routing | Best of both accelerators |
| Multi-threaded GEMM | Rayon parallelization | 4-12x speedup on M4 Pro |
| Flash Attention 2 | Auto block sizing, online softmax | O(N) memory, +10% throughput |
| Quantized Inference | INT8/INT4/Q4_K/Q8_K kernels | 4-8x memory reduction |
| Metal GPU Shaders | simdgroup_matrix operations | 3x speedup on Apple Silicon |
| GGUF Support | Memory-mapped model loading | Fast loading, reduced RAM |
| Continuous Batching | Dynamic batch scheduling | 2-3x throughput improvement |
| Speculative Decoding | Draft model acceleration | 2-3x faster generation |
| Gemma-2 & Phi-3 | New model architectures | Extended model support |
Features
Multiple Backends
- Candle Backend: HuggingFace's Candle framework with Metal/CUDA GPU acceleration
- Core ML Backend: Apple Neural Engine for maximum efficiency on Apple Silicon
- Hybrid Pipeline: Automatic routing between GPU and ANE based on operation type
- RuvLTRA Backend: Custom backend optimized for Claude Flow integration
Optimized Kernels
- NEON SIMD: ARM64-optimized kernels with 4x loop unrolling and FMA instructions
- Flash Attention 2: Memory-efficient attention with O(N) complexity and online softmax
- Paged Attention: Efficient KV cache management for long-context inference
- ANE Operations: GELU, SiLU, softmax, layer norm optimized for Neural Engine
Real-Time Learning (SONA)
- MicroLoRA: Per-request fine-tuning with rank 1-2 adapters (<1ms latency)
- EWC++: Elastic Weight Consolidation to prevent catastrophic forgetting
- Three-Tier Learning: Instant (<1ms), Background (~100ms), Deep (minutes)
Memory Efficiency
- Two-Tier KV Cache: FP16 tail + Q4/Q8 quantized store
- Grouped-Query Attention (GQA): 4-8x KV memory reduction
- Memory Pool: Arena allocator for zero-allocation inference
- GGUF Memory Mapping: Efficient large model loading
Quick Start
use *;
// Initialize backend with Metal GPU + ANE hybrid
let mut backend = with_device?;
// Load a GGUF model
backend.load_gguf?;
// Or load from HuggingFace
backend.load_model?;
// Generate text
let response = backend.generate?;
println!;
// Check SONA learning stats
if let Some = backend.sona_stats
Installation
Add to your Cargo.toml:
[]
# Recommended for Apple Silicon Mac
= { = "2.0", = ["inference-metal", "coreml", "parallel"] }
# For NVIDIA GPUs
= { = "2.0", = ["inference-cuda", "parallel"] }
# With RLM recursive reasoning
= { = "2.0", = ["rlm-full"] }
# Minimal (CPU only)
= { = "2.0" }
Feature Flags
| Feature | Description |
|---|---|
candle |
Enable Candle backend (HuggingFace) |
metal |
Apple Silicon GPU acceleration via Candle |
metal-compute |
Native Metal compute shaders (M4 Pro optimized) |
cuda |
NVIDIA GPU acceleration |
coreml |
Apple Neural Engine via Core ML |
hybrid-ane |
GPU+ANE hybrid pipeline (recommended for Mac) |
inference-metal |
Full Metal inference stack |
inference-metal-native |
Metal + native shaders (best M4 Pro perf) |
inference-cuda |
Full CUDA inference stack |
parallel |
Multi-threaded GEMM/GEMV with Rayon |
accelerate |
Apple Accelerate BLAS (~2x GEMV speedup) |
gguf-mmap |
Memory-mapped GGUF loading |
async-runtime |
Tokio async support |
wasm |
WebAssembly support |
rlm-core |
RLM recursive reasoning core (includes cache, pools, SIMD) |
rlm-wasm |
RLM with WASM support for browsers |
rlm-full |
Full RLM with async runtime |
attention |
Ruvector attention mechanisms |
graph |
Ruvector graph integration |
gnn |
Graph neural network support |
ruvector-full |
All Ruvector integrations |
Architecture
+----------------------------------+
| Application |
+----------------------------------+
|
+----------------------------------+
| RuvLLM Backend |
| +----------------------------+ |
| | Hybrid Pipeline Router | |
| | +----------+ +----------+ | |
| | | Metal | | ANE | | |
| | | GPU | | Core ML | | |
| | +----+-----+ +----+-----+ | |
| | | v | | |
| | Attention MLP/FFN | |
| | RoPE Activations | |
| | Softmax LayerNorm | |
| +----------------------------+ |
| | |
| +----------------------------+ |
| | SONA Learning | |
| | - Instant (<1ms) | |
| | - Background (~100ms) | |
| | - Deep (minutes) | |
| +----------------------------+ |
| | |
| +----------------------------+ |
| | NEON/SIMD Kernels | |
| | - Flash Attention 2 | |
| | - Paged KV Cache | |
| | - Quantized MatMul | |
| +----------------------------+ |
+----------------------------------+
Supported Models
| Model Family | Sizes | Quantization | Backend |
|---|---|---|---|
| RuvLTRA-Small | 0.5B | Q4K, Q5K, Q8, FP16 | Candle/Metal/ANE |
| RuvLTRA-Medium | 3B | Q4K, Q5K, Q8, FP16 | Candle/Metal |
| Qwen 2.5 | 0.5B-72B | Q4K, Q8, FP16 | Candle/Metal |
| Llama 3.x | 8B-70B | Q4K, Q8, FP16 | Candle/Metal |
| Mistral | 7B-22B | Q4K, Q8, FP16 | Candle/Metal |
| Phi-3 | 3.8B-14B | Q4K, Q8, FP16 | Candle/Metal |
| Gemma-2 | 2B-27B | Q4K, Q8, FP16 | Candle/Metal |
RuvLTRA Models (Claude Flow Optimized)
| Model | Parameters | Hidden | Layers | Context | Features |
|---|---|---|---|---|---|
| RuvLTRA-Small | 494M | 896 | 24 | 32K | GQA 7:1, SONA hooks |
| RuvLTRA-Medium | 3.0B | 2560 | 42 | 256K | Flash Attention 2, Speculative Decode |
HuggingFace Model Links
Pre-trained RuvLTRA models are available on HuggingFace:
- Repository: huggingface.co/ruv/ruvltra
| Model | File | Size | Purpose |
|---|---|---|---|
| RuvLTRA Claude Code 0.5B | ruvltra-claude-code-0.5b-q4_k_m.gguf |
~400MB | Agent routing (100% accuracy with hybrid) |
| RuvLTRA Small 0.5B | ruvltra-0.5b-q4_k_m.gguf |
~400MB | General embeddings |
| RuvLTRA Medium 3B | ruvltra-3b-q4_k_m.gguf |
~2GB | Full LLM inference |
Download models:
# Using huggingface-cli
# Or via the API
Performance Benchmarks
Inference (M4 Pro 14-core)
| Model | Quant | Prefill (tok/s) | Decode (tok/s) | Memory |
|---|---|---|---|---|
| Qwen2.5-7B | Q4K | 2,800 | 95 | 4.2 GB |
| Qwen2.5-7B | Q8 | 2,100 | 72 | 7.8 GB |
| Llama3-8B | Q4K | 2,600 | 88 | 4.8 GB |
| Mistral-7B | Q4K | 2,500 | 85 | 4.1 GB |
| Phi-3-3.8B | Q4K | 3,500 | 135 | 2.3 GB |
| Gemma2-9B | Q4K | 2,200 | 75 | 5.2 GB |
RLM Decomposition Performance
| Query Complexity | Sub-queries | Decomposition Time | Total Time |
|---|---|---|---|
| Simple | 1 | <1ms | 50-100ms |
| Moderate | 2-3 | 2-5ms | 150-300ms |
| Complex | 4-6 | 5-10ms | 400-800ms |
| Deep reasoning | 6-10 | 10-20ms | 1-3s |
ANE vs GPU Performance (M4 Pro)
| Dimension | ANE | GPU | Winner |
|---|---|---|---|
| < 512 | +30-50% | - | ANE |
| 512-1024 | +10-30% | - | ANE |
| 1024-1536 | ~Similar | ~Similar | Either |
| 1536-2048 | - | +10-20% | GPU |
| > 2048 | - | +30-50% | GPU |
Kernel Benchmarks
| Kernel | Single-thread | Multi-thread (10-core) |
|---|---|---|
| GEMM 4096x4096 | 1.2 GFLOPS | 12.7 GFLOPS |
| GEMV 4096x4096 | 0.8 GFLOPS | 6.4 GFLOPS |
| Flash Attention (seq=2048) | 850us | 320us |
| RMS Norm (4096) | 2.1us | 0.8us |
| RoPE (4096, 128) | 4.3us | 1.6us |
RLM Usage Examples
Basic Recursive Query
use ;
let controller = new?;
// Complex query gets automatically decomposed
let result = controller.query?;
println!;
println!;
println!;
With Memory Context
// Add domain knowledge to memory
controller.add_memory?;
controller.add_memory?;
// Query now uses memory context
let result = controller.query?;
Custom Decomposition Strategy
use ;
let config = new
.max_depth
.token_budget
.decomposition_strategy
.aggregation
.build?;
let controller = new?;
Using Memory Pools for High-Throughput
use ;
// Create pre-warmed pools for embedding operations
let vector_pool = new_warmed;
// Or use the pool manager for convenience
let manager = warmed;
// Acquire vectors from pool (zero allocation if pool has capacity)
let mut embedding = manager.vector_pool.acquire;
embedding.extend_from_slice;
// Vector automatically returns to pool on drop
// Check pool statistics
let stats = manager.stats;
println!;
println!;
WASM Usage (Browser)
use ;
async
Apple Neural Engine (ANE) Integration
RuvLLM includes full ANE support via Core ML:
use ;
// Create ANE-optimized backend
let backend = new?;
// Or use hybrid pipeline for best performance
use HybridPipeline;
let pipeline = new?;
ANE Routing Recommendations
| Operation | Recommended | Reason |
|---|---|---|
| Attention | GPU | Better for variable sequence lengths |
| Flash Attention | GPU | GPU memory bandwidth advantage |
| MLP/FFN | ANE | Optimal for fixed-size matmuls |
| GELU/SiLU | ANE | Dedicated activation units |
| LayerNorm/RMSNorm | ANE | Good for small dimensions |
| Embedding | GPU | Sparse operations |
MicroLoRA Real-Time Adaptation
RuvLLM supports per-request fine-tuning using MicroLoRA:
use ;
// Create MicroLoRA adapter
let config = for_hidden_dim;
let lora = new;
// Adapt on user feedback
let feedback = from_quality;
lora.adapt?;
// Apply learned updates
lora.apply_updates; // learning rate
// Get adaptation stats
let stats = lora.stats;
println!;
SONA Three-Tier Learning
Continuous improvement with three learning loops:
use ;
let config = SonaLlmConfig ;
let sona = new;
// 1. Instant Loop (<1ms): Per-request MicroLoRA
let result = sona.instant_adapt;
println!;
// 2. Background Loop (~100ms): Pattern consolidation
if let result = sona.maybe_background
// 3. Deep Loop (minutes): Full optimization
if sona.should_trigger_deep
Two-Tier KV Cache
Memory-efficient caching with automatic tiering:
use ;
let config = KvCacheConfig ;
let cache = new;
cache.append?;
// Automatic migration from tail to quantized store
let stats = cache.stats;
println!;
println!;
HuggingFace Hub Integration
Download and upload models to HuggingFace Hub:
use ;
// Download from Hub
let downloader = new;
let model_path = downloader.download?;
// Or use the registry for RuvLTRA models
let registry = new;
let model = registry.get?;
// Upload to Hub (requires HF_TOKEN)
let uploader = new;
let url = uploader.upload?;
println!;
Configuration
Environment Variables
| Variable | Description | Default |
|---|---|---|
RUVLLM_CACHE_DIR |
Model cache directory | ~/.cache/ruvllm |
RUVLLM_LOG_LEVEL |
Logging level | info |
RUVLLM_METAL_DEVICE |
Metal device index | 0 |
RUVLLM_ANE_ENABLED |
Enable ANE routing | true |
RUVLLM_SONA_ENABLED |
Enable SONA learning | true |
HF_TOKEN |
HuggingFace API token | - |
Model Configuration
let config = ModelConfig ;
Benchmarks
Run benchmarks with:
# Attention benchmarks
# ANE benchmarks (Mac only)
# LoRA benchmarks
# RLM benchmarks
# End-to-end inference
# Metal shader benchmarks
# Serving benchmarks
# RuvLTRA router benchmarks
npm Package
RuvLLM is also available as an npm package with native bindings:
import { RuvLLM } from '@ruvector/ruvllm';
const llm = new RuvLLM();
const response = llm.query('Explain quantum computing');
console.log(response.text);
See @ruvector/ruvllm on npm for full documentation.
Error Handling
use ;
match backend.generate
License
Apache-2.0 / MIT dual license.
Contributing
Contributions welcome! Please see CONTRIBUTING.md for guidelines.