ruvllm 2.0.0

LLM serving runtime with Ruvector integration - Paged attention, KV cache, and SONA learning
Documentation

RuvLLM v2.0 - High-Performance LLM Inference for Rust

RuvLLM is a production-ready Rust LLM inference engine optimized for Apple Silicon (M1-M4), featuring real-time fine-tuning, NEON SIMD acceleration, Apple Neural Engine integration, and the SONA self-optimizing neural architecture.

What's New in v2.0

Major Features

Feature Description Benefit
RLM (Recursive Language Model) Recursive query decomposition for complex reasoning Break down complex questions, parallel sub-query processing
RuvLTRA-Medium 3B Purpose-built 3B model for Claude Flow 42 layers, 256K context, speculative decode
HuggingFace Hub Full Hub integration (download/upload) Easy model sharing & distribution
Task-Specific LoRA 5 pre-trained adapters for agent types Optimized for coder/researcher/security/architect/reviewer
Adapter Merging TIES, DARE, SLERP, Task Arithmetic Combine adapters for multi-task models
Hot-Swap Adapters Zero-downtime adapter switching Runtime task specialization
WASM Support WebAssembly target for browser-based inference Run LLMs in the browser
HNSW Routing 150x faster semantic pattern matching <25us pattern retrieval

RLM (Recursive Language Model) Architecture

RLM provides a sophisticated recursive reasoning pipeline:

+------------------+
|  RlmController   |  <-- Main entry point
+--------+---------+
         |
         v
+--------+---------+
| QueryDecomposer  |  <-- Breaks complex queries
+--------+---------+
         |
   +-----+-----+
   |           |
+--v--+     +--v--+
|Sub  |     |Sub  |  <-- Parallel sub-query processing
|Query|     |Query|
+--+--+     +--+--+
   |           |
   +-----+-----+
         |
         v
+--------+---------+
|AnswerSynthesizer |  <-- Combines sub-answers
+--------+---------+
         |
         v
+--------+---------+
|   RlmMemory      |  <-- HNSW-indexed retrieval
+-----------------+

RLM Quick Start

use ruvllm::rlm::{RlmController, RlmConfig, NativeEnvironment};

// Create controller with default config
let config = RlmConfig::default();
let controller = RlmController::<NativeEnvironment>::new(config)?;

// Query the model with recursive decomposition
let response = controller.query("What is recursive language modeling and how does it improve reasoning?")?;
println!("Answer: {}", response.text);

// Add to memory for future retrieval
controller.add_memory("RLM uses recursive decomposition.", Default::default())?;

// Search memory semantically
let results = controller.search_memory("recursive", 5)?;

RLM Configuration

use ruvllm::rlm::{RecursiveConfig, RecursiveConfigBuilder, AggregationStrategy};

let config = RecursiveConfigBuilder::new()
    .max_depth(5)                              // Maximum recursion depth
    .token_budget(16000)                       // Total token budget
    .enable_cache(true)                        // Enable memoization
    .aggregation(AggregationStrategy::WeightedMerge)
    .parallel_subqueries(true)                 // Process sub-queries in parallel
    .build()?;

Previous Features (v1.x-2.x)

Feature Description Benefit
Apple Neural Engine Core ML backend with ANE routing 38 TOPS, 3-4x power efficiency
Hybrid GPU+ANE Pipeline Intelligent operation routing Best of both accelerators
Multi-threaded GEMM Rayon parallelization 4-12x speedup on M4 Pro
Flash Attention 2 Auto block sizing, online softmax O(N) memory, +10% throughput
Quantized Inference INT8/INT4/Q4_K/Q8_K kernels 4-8x memory reduction
Metal GPU Shaders simdgroup_matrix operations 3x speedup on Apple Silicon
GGUF Support Memory-mapped model loading Fast loading, reduced RAM
Continuous Batching Dynamic batch scheduling 2-3x throughput improvement
Speculative Decoding Draft model acceleration 2-3x faster generation
Gemma-2 & Phi-3 New model architectures Extended model support

Features

Multiple Backends

  • Candle Backend: HuggingFace's Candle framework with Metal/CUDA GPU acceleration
  • Core ML Backend: Apple Neural Engine for maximum efficiency on Apple Silicon
  • Hybrid Pipeline: Automatic routing between GPU and ANE based on operation type
  • RuvLTRA Backend: Custom backend optimized for Claude Flow integration

Optimized Kernels

  • NEON SIMD: ARM64-optimized kernels with 4x loop unrolling and FMA instructions
  • Flash Attention 2: Memory-efficient attention with O(N) complexity and online softmax
  • Paged Attention: Efficient KV cache management for long-context inference
  • ANE Operations: GELU, SiLU, softmax, layer norm optimized for Neural Engine

Real-Time Learning (SONA)

  • MicroLoRA: Per-request fine-tuning with rank 1-2 adapters (<1ms latency)
  • EWC++: Elastic Weight Consolidation to prevent catastrophic forgetting
  • Three-Tier Learning: Instant (<1ms), Background (~100ms), Deep (minutes)

Memory Efficiency

  • Two-Tier KV Cache: FP16 tail + Q4/Q8 quantized store
  • Grouped-Query Attention (GQA): 4-8x KV memory reduction
  • Memory Pool: Arena allocator for zero-allocation inference
  • GGUF Memory Mapping: Efficient large model loading

Quick Start

use ruvllm::prelude::*;

// Initialize backend with Metal GPU + ANE hybrid
let mut backend = CandleBackend::with_device(DeviceType::Metal)?;

// Load a GGUF model
backend.load_gguf("models/qwen2.5-7b-q4_k.gguf", ModelConfig::default())?;

// Or load from HuggingFace
backend.load_model("Qwen/Qwen2.5-7B-Instruct", ModelConfig {
    quantization: Quantization::Q4K,
    use_flash_attention: true,
    ..Default::default()
})?;

// Generate text
let response = backend.generate("Explain quantum computing in simple terms.",
    GenerateParams {
        max_tokens: 256,
        temperature: 0.7,
        top_p: 0.9,
        ..Default::default()
    }
)?;

println!("{}", response);

// Check SONA learning stats
if let Some(stats) = backend.sona_stats() {
    println!("Patterns learned: {}", stats.patterns_learned);
    println!("Quality improvement: {:.1}%", stats.quality_improvement * 100.0);
}

Installation

Add to your Cargo.toml:

[dependencies]
# Recommended for Apple Silicon Mac
ruvllm = { version = "2.0", features = ["inference-metal", "coreml", "parallel"] }

# For NVIDIA GPUs
ruvllm = { version = "2.0", features = ["inference-cuda", "parallel"] }

# With RLM recursive reasoning
ruvllm = { version = "2.0", features = ["rlm-full"] }

# Minimal (CPU only)
ruvllm = { version = "2.0" }

Feature Flags

Feature Description
candle Enable Candle backend (HuggingFace)
metal Apple Silicon GPU acceleration via Candle
metal-compute Native Metal compute shaders (M4 Pro optimized)
cuda NVIDIA GPU acceleration
coreml Apple Neural Engine via Core ML
hybrid-ane GPU+ANE hybrid pipeline (recommended for Mac)
inference-metal Full Metal inference stack
inference-metal-native Metal + native shaders (best M4 Pro perf)
inference-cuda Full CUDA inference stack
parallel Multi-threaded GEMM/GEMV with Rayon
accelerate Apple Accelerate BLAS (~2x GEMV speedup)
gguf-mmap Memory-mapped GGUF loading
async-runtime Tokio async support
wasm WebAssembly support
rlm-core RLM recursive reasoning core
rlm-wasm RLM with WASM support for browsers
rlm-full Full RLM with async runtime
attention Ruvector attention mechanisms
graph Ruvector graph integration
gnn Graph neural network support
ruvector-full All Ruvector integrations

Architecture

+----------------------------------+
|         Application              |
+----------------------------------+
              |
+----------------------------------+
|        RuvLLM Backend            |
|  +----------------------------+  |
|  |   Hybrid Pipeline Router   |  |
|  |  +----------+ +----------+ |  |
|  |  |  Metal   | |   ANE    | |  |
|  |  |   GPU    | | Core ML  | |  |
|  |  +----+-----+ +----+-----+ |  |
|  |       |    v      |        |  |
|  |  Attention    MLP/FFN      |  |
|  |  RoPE         Activations  |  |
|  |  Softmax      LayerNorm    |  |
|  +----------------------------+  |
|              |                   |
|  +----------------------------+  |
|  |     SONA Learning          |  |
|  |  - Instant (<1ms)          |  |
|  |  - Background (~100ms)     |  |
|  |  - Deep (minutes)          |  |
|  +----------------------------+  |
|              |                   |
|  +----------------------------+  |
|  |     NEON/SIMD Kernels      |  |
|  |  - Flash Attention 2       |  |
|  |  - Paged KV Cache          |  |
|  |  - Quantized MatMul        |  |
|  +----------------------------+  |
+----------------------------------+

Supported Models

Model Family Sizes Quantization Backend
RuvLTRA-Small 0.5B Q4K, Q5K, Q8, FP16 Candle/Metal/ANE
RuvLTRA-Medium 3B Q4K, Q5K, Q8, FP16 Candle/Metal
Qwen 2.5 0.5B-72B Q4K, Q8, FP16 Candle/Metal
Llama 3.x 8B-70B Q4K, Q8, FP16 Candle/Metal
Mistral 7B-22B Q4K, Q8, FP16 Candle/Metal
Phi-3 3.8B-14B Q4K, Q8, FP16 Candle/Metal
Gemma-2 2B-27B Q4K, Q8, FP16 Candle/Metal

RuvLTRA Models (Claude Flow Optimized)

Model Parameters Hidden Layers Context Features
RuvLTRA-Small 494M 896 24 32K GQA 7:1, SONA hooks
RuvLTRA-Medium 3.0B 2560 42 256K Flash Attention 2, Speculative Decode

Performance Benchmarks

Inference (M4 Pro 14-core)

Model Quant Prefill (tok/s) Decode (tok/s) Memory
Qwen2.5-7B Q4K 2,800 95 4.2 GB
Qwen2.5-7B Q8 2,100 72 7.8 GB
Llama3-8B Q4K 2,600 88 4.8 GB
Mistral-7B Q4K 2,500 85 4.1 GB
Phi-3-3.8B Q4K 3,500 135 2.3 GB
Gemma2-9B Q4K 2,200 75 5.2 GB

RLM Decomposition Performance

Query Complexity Sub-queries Decomposition Time Total Time
Simple 1 <1ms 50-100ms
Moderate 2-3 2-5ms 150-300ms
Complex 4-6 5-10ms 400-800ms
Deep reasoning 6-10 10-20ms 1-3s

ANE vs GPU Performance (M4 Pro)

Dimension ANE GPU Winner
< 512 +30-50% - ANE
512-1024 +10-30% - ANE
1024-1536 ~Similar ~Similar Either
1536-2048 - +10-20% GPU
> 2048 - +30-50% GPU

Kernel Benchmarks

Kernel Single-thread Multi-thread (10-core)
GEMM 4096x4096 1.2 GFLOPS 12.7 GFLOPS
GEMV 4096x4096 0.8 GFLOPS 6.4 GFLOPS
Flash Attention (seq=2048) 850us 320us
RMS Norm (4096) 2.1us 0.8us
RoPE (4096, 128) 4.3us 1.6us

RLM Usage Examples

Basic Recursive Query

use ruvllm::rlm::{RlmController, RlmConfig, NativeEnvironment};

let controller = RlmController::<NativeEnvironment>::new(RlmConfig::default())?;

// Complex query gets automatically decomposed
let result = controller.query(
    "Compare the economic policies of keynesian and monetarist schools,
     and explain how each would respond to stagflation"
)?;

println!("Answer: {}", result.text);
println!("Sub-queries processed: {}", result.stats.sub_queries_count);
println!("Memory retrievals: {}", result.stats.memory_hits);

With Memory Context

// Add domain knowledge to memory
controller.add_memory(
    "Keynesian economics emphasizes government intervention and aggregate demand",
    Default::default()
)?;
controller.add_memory(
    "Monetarism focuses on controlling money supply to manage inflation",
    Default::default()
)?;

// Query now uses memory context
let result = controller.query("What are the key differences between these economic schools?")?;

Custom Decomposition Strategy

use ruvllm::rlm::{RecursiveConfigBuilder, AggregationStrategy, DecomposerStrategy};

let config = RecursiveConfigBuilder::new()
    .max_depth(3)
    .token_budget(8000)
    .decomposition_strategy(DecomposerStrategy::Semantic)
    .aggregation(AggregationStrategy::Summarize)
    .build()?;

let controller = RlmController::<NativeEnvironment>::new(config)?;

WASM Usage (Browser)

#[cfg(target_arch = "wasm32")]
use ruvllm::rlm::{WasmRlmController, WasmEnvironment};

#[cfg(target_arch = "wasm32")]
async fn query_in_browser() -> Result<String, JsValue> {
    let controller = WasmRlmController::new(Default::default()).await?;
    let result = controller.query("What is machine learning?").await?;
    Ok(result.text)
}

Apple Neural Engine (ANE) Integration

RuvLLM includes full ANE support via Core ML:

use ruvllm::backends::coreml::{CoreMLBackend, AneStrategy};

// Create ANE-optimized backend
let backend = CoreMLBackend::new(AneStrategy::PreferAneForMlp)?;

// Or use hybrid pipeline for best performance
use ruvllm::backends::HybridPipeline;

let pipeline = HybridPipeline::new(HybridConfig {
    ane_strategy: AneStrategy::Adaptive,
    gpu_for_attention: true,  // Attention on GPU
    ane_for_mlp: true,        // MLP/FFN on ANE
    ..Default::default()
})?;

ANE Routing Recommendations

Operation Recommended Reason
Attention GPU Better for variable sequence lengths
Flash Attention GPU GPU memory bandwidth advantage
MLP/FFN ANE Optimal for fixed-size matmuls
GELU/SiLU ANE Dedicated activation units
LayerNorm/RMSNorm ANE Good for small dimensions
Embedding GPU Sparse operations

MicroLoRA Real-Time Adaptation

RuvLLM supports per-request fine-tuning using MicroLoRA:

use ruvllm::lora::{MicroLoRA, MicroLoraConfig, AdaptFeedback};

// Create MicroLoRA adapter
let config = MicroLoraConfig::for_hidden_dim(4096);
let lora = MicroLoRA::new(config);

// Adapt on user feedback
let feedback = AdaptFeedback::from_quality(0.9);
lora.adapt(&input_embedding, feedback)?;

// Apply learned updates
lora.apply_updates(0.01); // learning rate

// Get adaptation stats
let stats = lora.stats();
println!("Samples: {}, Avg quality: {:.2}", stats.samples, stats.avg_quality);

SONA Three-Tier Learning

Continuous improvement with three learning loops:

use ruvllm::optimization::{SonaLlm, SonaLlmConfig, ConsolidationStrategy};

let config = SonaLlmConfig {
    instant_lr: 0.01,
    background_interval_ms: 100,
    deep_trigger_threshold: 100.0,
    consolidation_strategy: ConsolidationStrategy::EwcMerge,
    ..Default::default()
};

let sona = SonaLlm::new(config);

// 1. Instant Loop (<1ms): Per-request MicroLoRA
let result = sona.instant_adapt("user query", "model response", 0.85);
println!("Instant adapt: {}us", result.latency_us);

// 2. Background Loop (~100ms): Pattern consolidation
if let result = sona.maybe_background() {
    if result.applied {
        println!("Consolidated {} samples", result.samples_used);
    }
}

// 3. Deep Loop (minutes): Full optimization
if sona.should_trigger_deep() {
    let result = sona.deep_optimize(OptimizationTrigger::QualityThreshold(100.0));
    println!("Deep optimization: {:.1}s", result.latency_us as f64 / 1_000_000.0);
}

Two-Tier KV Cache

Memory-efficient caching with automatic tiering:

use ruvllm::kv_cache::{TwoTierKvCache, KvCacheConfig};

let config = KvCacheConfig {
    tail_length: 256,              // Recent tokens in FP16
    tail_precision: Precision::FP16,
    store_precision: Precision::Q4,  // Older tokens in Q4
    max_tokens: 8192,
    num_layers: 32,
    num_kv_heads: 8,
    head_dim: 128,
};

let cache = TwoTierKvCache::new(config);
cache.append(&keys, &values)?;

// Automatic migration from tail to quantized store
let stats = cache.stats();
println!("Tail: {} tokens, Store: {} tokens", stats.tail_tokens, stats.store_tokens);
println!("Compression ratio: {:.2}x", stats.compression_ratio);

HuggingFace Hub Integration

Download and upload models to HuggingFace Hub:

use ruvllm::hub::{ModelDownloader, ModelUploader, RuvLtraRegistry, DownloadConfig};

// Download from Hub
let downloader = ModelDownloader::new(DownloadConfig::default());
let model_path = downloader.download(
    "ruvector/ruvltra-small-q4km",
    Some("./models"),
)?;

// Or use the registry for RuvLTRA models
let registry = RuvLtraRegistry::new();
let model = registry.get("ruvltra-medium", "Q4_K_M")?;

// Upload to Hub (requires HF_TOKEN)
let uploader = ModelUploader::new("hf_your_token");
let url = uploader.upload(
    "./my-model.gguf",
    "username/my-ruvltra-model",
    Some(metadata),
)?;
println!("Uploaded to: {}", url);

Configuration

Environment Variables

Variable Description Default
RUVLLM_CACHE_DIR Model cache directory ~/.cache/ruvllm
RUVLLM_LOG_LEVEL Logging level info
RUVLLM_METAL_DEVICE Metal device index 0
RUVLLM_ANE_ENABLED Enable ANE routing true
RUVLLM_SONA_ENABLED Enable SONA learning true

Model Configuration

let config = ModelConfig {
    max_context: 8192,
    use_flash_attention: true,
    quantization: Quantization::Q4K,
    kv_cache_config: KvCacheConfig::default(),
    rope_scaling: Some(RopeScaling::Linear { factor: 2.0 }),
    sliding_window: Some(4096),
    ..Default::default()
};

Benchmarks

Run benchmarks with:

# Attention benchmarks
cargo bench --bench attention_bench --features inference-metal

# ANE benchmarks (Mac only)
cargo bench --bench ane_bench --features coreml

# LoRA benchmarks
cargo bench --bench lora_bench

# RLM benchmarks
cargo bench --bench rlm_bench --features rlm-full

# End-to-end inference
cargo bench --bench e2e_bench --features inference-metal

# Metal shader benchmarks
cargo bench --bench metal_bench --features metal-compute

# Serving benchmarks
cargo bench --bench serving_bench --features inference-metal

npm Package

RuvLLM is also available as an npm package with native bindings:

npm install @ruvector/ruvllm
import { RuvLLM } from '@ruvector/ruvllm';

const llm = new RuvLLM();
const response = llm.query('Explain quantum computing');
console.log(response.text);

See @ruvector/ruvllm on npm for full documentation.

Error Handling

use ruvllm::error::{Result, RuvLLMError};

match backend.generate(prompt, params) {
    Ok(response) => println!("{}", response),
    Err(RuvLLMError::Model(e)) => eprintln!("Model error: {}", e),
    Err(RuvLLMError::OutOfMemory(e)) => eprintln!("OOM: {}", e),
    Err(RuvLLMError::Generation(e)) => eprintln!("Generation failed: {}", e),
    Err(RuvLLMError::Ane(e)) => eprintln!("ANE error: {}", e),
    Err(RuvLLMError::Gguf(e)) => eprintln!("GGUF loading error: {}", e),
    Err(e) => eprintln!("Error: {}", e),
}

License

Apache-2.0 / MIT dual license.

Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

Links