RuvLLM v2.3 - High-Performance LLM Inference for Rust

RuvLLM is a production-ready Rust LLM inference engine optimized for Apple Silicon (M1-M4), featuring real-time fine-tuning, NEON SIMD acceleration, Apple Neural Engine integration, and the SONA self-optimizing neural architecture.

What's New in v2.3

Major Features

Feature	Description	Benefit
RuvLTRA-Medium 3B	Purpose-built 3B model for Claude Flow	42 layers, 256K context, speculative decode
HuggingFace Hub	Full Hub integration (download/upload)	Easy model sharing & distribution
Task-Specific LoRA	5 pre-trained adapters for agent types	Optimized for coder/researcher/security/architect/reviewer
Adapter Merging	TIES, DARE, SLERP, Task Arithmetic	Combine adapters for multi-task models
Hot-Swap Adapters	Zero-downtime adapter switching	Runtime task specialization
Claude Dataset	2,700+ Claude-style training examples	Optimized for Claude Flow integration
HNSW Routing	150x faster semantic pattern matching	<25µs pattern retrieval
Evaluation Harness	Real model evaluation with SWE-Bench	5 ablation modes, quality metrics
HNSW Auto-Dimension	Automatic embedding dimension detection	No manual config needed
mistral-rs Backend	Production-scale serving with PagedAttention	5-10x concurrent users, X-LoRA, ISQ

Previous v2.0-2.2 Features

Feature	Description	Benefit
Apple Neural Engine	Core ML backend with ANE routing	38 TOPS, 3-4x power efficiency
Hybrid GPU+ANE Pipeline	Intelligent operation routing	Best of both accelerators
Multi-threaded GEMM	Rayon parallelization	4-12x speedup on M4 Pro
Flash Attention 2	Auto block sizing, online softmax	O(N) memory, +10% throughput
Quantized Inference	INT8/INT4/Q4_K/Q8_K kernels	4-8x memory reduction
Metal GPU Shaders	simdgroup_matrix operations	3x speedup on Apple Silicon
GGUF Support	Memory-mapped model loading	Fast loading, reduced RAM
Continuous Batching	Dynamic batch scheduling	2-3x throughput improvement
Speculative Decoding	Draft model acceleration	2-3x faster generation
Gemma-2 & Phi-3	New model architectures	Extended model support

Features

Multiple Backends

Candle Backend: HuggingFace's Candle framework with Metal/CUDA GPU acceleration
Core ML Backend: Apple Neural Engine for maximum efficiency on Apple Silicon
Hybrid Pipeline: Automatic routing between GPU and ANE based on operation type

Optimized Kernels

NEON SIMD: ARM64-optimized kernels with 4x loop unrolling and FMA instructions
Flash Attention 2: Memory-efficient attention with O(N) complexity and online softmax
Paged Attention: Efficient KV cache management for long-context inference
ANE Operations: GELU, SiLU, softmax, layer norm optimized for Neural Engine

Real-Time Learning (SONA)

MicroLoRA: Per-request fine-tuning with rank 1-2 adapters (<1ms latency)
EWC++: Elastic Weight Consolidation to prevent catastrophic forgetting
Three-Tier Learning: Instant (<1ms), Background (~100ms), Deep (minutes)

Memory Efficiency

Two-Tier KV Cache: FP16 tail + Q4/Q8 quantized store
Grouped-Query Attention (GQA): 4-8x KV memory reduction
Memory Pool: Arena allocator for zero-allocation inference
GGUF Memory Mapping: Efficient large model loading

Quick Start

use ruvllm::prelude::*;

// Initialize backend with Metal GPU + ANE hybrid
let mut backend = CandleBackend::with_device(DeviceType::Metal)?;

// Load a GGUF model
backend.load_gguf("models/qwen2.5-7b-q4_k.gguf", ModelConfig::default())?;

// Or load from HuggingFace
backend.load_model("Qwen/Qwen2.5-7B-Instruct", ModelConfig {
    quantization: Quantization::Q4K,
    use_flash_attention: true,
    ..Default::default()
})?;

// Generate text
let response = backend.generate("Explain quantum computing in simple terms.",
    GenerateParams {
        max_tokens: 256,
        temperature: 0.7,
        top_p: 0.9,
        ..Default::default()
    }
)?;

println!("{}", response);

// Check SONA learning stats
if let Some(stats) = backend.sona_stats() {
    println!("Patterns learned: {}", stats.patterns_learned);
    println!("Quality improvement: {:.1}%", stats.quality_improvement * 100.0);
}

Installation

Add to your Cargo.toml:

[dependencies]
# Recommended for Apple Silicon Mac
ruvllm = { version = "2.0", features = ["inference-metal", "coreml", "parallel"] }

# For NVIDIA GPUs
ruvllm = { version = "2.0", features = ["inference-cuda", "parallel"] }

# Minimal (CPU only)
ruvllm = { version = "2.0" }

Feature Flags

Feature	Description
`candle`	Enable Candle backend (HuggingFace)
`metal`	Apple Silicon GPU acceleration via Candle
`metal-compute`	Native Metal compute shaders (M4 Pro optimized)
`cuda`	NVIDIA GPU acceleration
`coreml`	Apple Neural Engine via Core ML
`hybrid-ane`	GPU+ANE hybrid pipeline (recommended for Mac)
`inference-metal`	Full Metal inference stack
`inference-metal-native`	Metal + native shaders (best M4 Pro perf)
`inference-cuda`	Full CUDA inference stack
`parallel`	Multi-threaded GEMM/GEMV with Rayon
`accelerate`	Apple Accelerate BLAS (~2x GEMV speedup)
`gguf-mmap`	Memory-mapped GGUF loading
`async-runtime`	Tokio async support
`wasm`	WebAssembly support
`mistral-rs`	mistral-rs backend (PagedAttention, X-LoRA, ISQ)
`mistral-rs-metal`	mistral-rs with Apple Silicon acceleration
`mistral-rs-cuda`	mistral-rs with NVIDIA CUDA acceleration

Architecture

+----------------------------------+
|         Application              |
+----------------------------------+
               |
+----------------------------------+
|        RuvLLM Backend            |
|  +----------------------------+  |
|  |   Hybrid Pipeline Router   |  |
|  |  ┌─────────┐ ┌──────────┐  |  |
|  |  │  Metal  │ │   ANE    │  |  |
|  |  │   GPU   │ │ Core ML  │  |  |
|  |  └────┬────┘ └────┬─────┘  |  |
|  |       │    ↕      │        |  |
|  |  Attention    MLP/FFN      |  |
|  |  RoPE         Activations  |  |
|  |  Softmax      LayerNorm    |  |
|  +----------------------------+  |
|               |                  |
|  +----------------------------+  |
|  |     SONA Learning          |  |
|  |  - Instant (<1ms)          |  |
|  |  - Background (~100ms)     |  |
|  |  - Deep (minutes)          |  |
|  +----------------------------+  |
|               |                  |
|  +----------------------------+  |
|  |     NEON/SIMD Kernels      |  |
|  |  - Flash Attention 2       |  |
|  |  - Paged KV Cache          |  |
|  |  - Quantized MatMul        |  |
|  +----------------------------+  |
+----------------------------------+

Supported Models

Model Family	Sizes	Quantization	Backend
RuvLTRA-Small	0.5B	Q4K, Q5K, Q8, FP16	Candle/Metal/ANE
RuvLTRA-Medium	3B	Q4K, Q5K, Q8, FP16	Candle/Metal
Qwen 2.5	0.5B-72B	Q4K, Q8, FP16	Candle/Metal
Llama 3.x	8B-70B	Q4K, Q8, FP16	Candle/Metal
Mistral	7B-22B	Q4K, Q8, FP16	Candle/Metal
Phi-3	3.8B-14B	Q4K, Q8, FP16	Candle/Metal
Gemma-2	2B-27B	Q4K, Q8, FP16	Candle/Metal

RuvLTRA Models (Claude Flow Optimized)

Model	Parameters	Hidden	Layers	Context	Features
RuvLTRA-Small	494M	896	24	32K	GQA 7:1, SONA hooks
RuvLTRA-Medium	3.0B	2560	42	256K	Flash Attention 2, Speculative Decode

Performance (M4 Pro 14-core)

Inference Benchmarks

Model	Quant	Prefill (tok/s)	Decode (tok/s)	Memory
Qwen2.5-7B	Q4K	2,800	95	4.2 GB
Qwen2.5-7B	Q8	2,100	72	7.8 GB
Llama3-8B	Q4K	2,600	88	4.8 GB
Mistral-7B	Q4K	2,500	85	4.1 GB
Phi-3-3.8B	Q4K	3,500	135	2.3 GB
Gemma2-9B	Q4K	2,200	75	5.2 GB

ANE vs GPU Performance (M4 Pro)

Dimension	ANE	GPU	Winner
< 512	+30-50%	-	ANE
512-1024	+10-30%	-	ANE
1024-1536	~Similar	~Similar	Either
1536-2048	-	+10-20%	GPU
> 2048	-	+30-50%	GPU

Kernel Benchmarks

Kernel	Single-thread	Multi-thread (10-core)
GEMM 4096x4096	1.2 GFLOPS	12.7 GFLOPS
GEMV 4096x4096	0.8 GFLOPS	6.4 GFLOPS
Flash Attention (seq=2048)	850μs	320μs
RMS Norm (4096)	2.1μs	0.8μs
RoPE (4096, 128)	4.3μs	1.6μs

Apple Neural Engine (ANE) Integration

RuvLLM v2.0 includes full ANE support via Core ML:

use ruvllm::backends::coreml::{CoreMLBackend, AneStrategy};

// Create ANE-optimized backend
let backend = CoreMLBackend::new(AneStrategy::PreferAneForMlp)?;

// Or use hybrid pipeline for best performance
use ruvllm::backends::HybridPipeline;

let pipeline = HybridPipeline::new(HybridConfig {
    ane_strategy: AneStrategy::Adaptive,
    gpu_for_attention: true,  // Attention on GPU
    ane_for_mlp: true,        // MLP/FFN on ANE
    ..Default::default()
})?;

ANE Routing Recommendations

Operation	Recommended	Reason
Attention	GPU	Better for variable sequence lengths
Flash Attention	GPU	GPU memory bandwidth advantage
MLP/FFN	ANE	Optimal for fixed-size matmuls
GELU/SiLU	ANE	Dedicated activation units
LayerNorm/RMSNorm	ANE	Good for small dimensions
Embedding	GPU	Sparse operations

MicroLoRA Real-Time Adaptation

RuvLLM supports per-request fine-tuning using MicroLoRA:

use ruvllm::lora::{MicroLoRA, MicroLoraConfig, AdaptFeedback};

// Create MicroLoRA adapter
let config = MicroLoraConfig::for_hidden_dim(4096);
let lora = MicroLoRA::new(config);

// Adapt on user feedback
let feedback = AdaptFeedback::from_quality(0.9);
lora.adapt(&input_embedding, feedback)?;

// Apply learned updates
lora.apply_updates(0.01); // learning rate

// Get adaptation stats
let stats = lora.stats();
println!("Samples: {}, Avg quality: {:.2}", stats.samples, stats.avg_quality);

SONA Three-Tier Learning

Continuous improvement with three learning loops:

use ruvllm::optimization::{SonaLlm, SonaLlmConfig, ConsolidationStrategy};

let config = SonaLlmConfig {
    instant_lr: 0.01,
    background_interval_ms: 100,
    deep_trigger_threshold: 100.0,
    consolidation_strategy: ConsolidationStrategy::EwcMerge,
    ..Default::default()
};

let sona = SonaLlm::new(config);

// 1. Instant Loop (<1ms): Per-request MicroLoRA
let result = sona.instant_adapt("user query", "model response", 0.85);
println!("Instant adapt: {}μs", result.latency_us);

// 2. Background Loop (~100ms): Pattern consolidation
if let result = sona.maybe_background() {
    if result.applied {
        println!("Consolidated {} samples", result.samples_used);
    }
}

// 3. Deep Loop (minutes): Full optimization
if sona.should_trigger_deep() {
    let result = sona.deep_optimize(OptimizationTrigger::QualityThreshold(100.0));
    println!("Deep optimization: {:.1}s", result.latency_us as f64 / 1_000_000.0);
}

// Check learning stats
let stats = sona.stats();
println!("Total samples: {}", stats.total_samples);
println!("Accumulated quality: {:.2}", stats.accumulated_quality);

Two-Tier KV Cache

Memory-efficient caching with automatic tiering:

use ruvllm::kv_cache::{TwoTierKvCache, KvCacheConfig};

let config = KvCacheConfig {
    tail_length: 256,              // Recent tokens in FP16
    tail_precision: Precision::FP16,
    store_precision: Precision::Q4,  // Older tokens in Q4
    max_tokens: 8192,
    num_layers: 32,
    num_kv_heads: 8,
    head_dim: 128,
};

let cache = TwoTierKvCache::new(config);
cache.append(&keys, &values)?;

// Automatic migration from tail to quantized store
let stats = cache.stats();
println!("Tail: {} tokens, Store: {} tokens", stats.tail_tokens, stats.store_tokens);
println!("Compression ratio: {:.2}x", stats.compression_ratio);
println!("Memory saved: {:.1} MB", stats.memory_saved_mb);

Continuous Batching

High-throughput serving with dynamic batching:

use ruvllm::serving::{ContinuousBatchScheduler, SchedulerConfig, InferenceRequest};

let scheduler = ContinuousBatchScheduler::new(SchedulerConfig {
    max_batch_size: 32,
    max_batch_tokens: 4096,
    max_waiting_time_ms: 50,
    preemption_mode: PreemptionMode::Recompute,
    ..Default::default()
});

// Add requests
scheduler.add_request(InferenceRequest::new(tokens, params))?;

// Process batch
while let Some(batch) = scheduler.get_next_batch() {
    let outputs = backend.forward_batch(&batch)?;
    scheduler.process_outputs(outputs)?;
}

// Get throughput stats
let stats = scheduler.stats();
println!("Throughput: {:.1} tok/s", stats.tokens_per_second);
println!("Batch utilization: {:.1}%", stats.avg_batch_utilization * 100.0);

Speculative Decoding

Accelerate generation with draft models:

use ruvllm::speculative::{SpeculativeDecoder, SpeculativeConfig};

let config = SpeculativeConfig {
    draft_tokens: 4,           // Tokens to draft per step
    acceptance_threshold: 0.8, // Min probability for acceptance
    ..Default::default()
};

let decoder = SpeculativeDecoder::new(
    target_model,
    draft_model,
    config,
)?;

// Generate with speculation
let output = decoder.generate(prompt, GenerateParams {
    max_tokens: 256,
    ..Default::default()
})?;

println!("Acceptance rate: {:.1}%", output.stats.acceptance_rate * 100.0);
println!("Speedup: {:.2}x", output.stats.speedup);

GGUF Model Loading

Efficient loading with memory mapping:

use ruvllm::gguf::{GgufLoader, GgufConfig};

let loader = GgufLoader::new(GgufConfig {
    mmap_enabled: true,       // Memory-map for fast loading
    validate_checksum: true,  // Verify file integrity
    ..Default::default()
});

// Load model metadata
let metadata = loader.read_metadata("model.gguf")?;
println!("Model: {}", metadata.name);
println!("Parameters: {}B", metadata.parameters / 1_000_000_000);
println!("Quantization: {:?}", metadata.quantization);

// Load into backend
let tensors = loader.load_tensors("model.gguf")?;
backend.load_tensors(tensors)?;

mistral-rs Backend (Production Serving)

RuvLLM v2.3 includes integration with mistral-rs for production-scale LLM serving with advanced memory management.

Note: The mistral-rs crate is not yet published to crates.io. The integration is designed and ready—enable it when mistral-rs becomes available.

Key Features

Feature	Description	Benefit
PagedAttention	vLLM-style KV cache management	5-10x concurrent users, 85-95% memory utilization
X-LoRA	Per-token adapter routing	<1ms routing overhead, multi-task inference
ISQ	In-Situ Quantization (AWQ, GPTQ, RTN)	Runtime quantization without re-export

Usage Example

use ruvllm::backends::mistral::{
    MistralBackend, MistralBackendConfig,
    PagedAttentionConfig, XLoraConfig, IsqConfig
};

// Configure mistral-rs backend for production serving
let config = MistralBackendConfig::builder()
    // PagedAttention: Enable 50+ concurrent users
    .paged_attention(PagedAttentionConfig {
        block_size: 16,
        max_blocks: 4096,
        gpu_memory_fraction: 0.9,
        enable_prefix_caching: true,
    })
    // X-LoRA: Per-token adapter routing
    .xlora(XLoraConfig {
        adapters: vec![
            "adapters/coder".into(),
            "adapters/researcher".into(),
        ],
        top_k: 2,
        temperature: 0.3,
    })
    // ISQ: Runtime quantization
    .isq(IsqConfig {
        bits: 4,
        method: IsqMethod::AWQ,
        calibration_samples: 128,
    })
    .build();

let mut backend = MistralBackend::new(config)?;
backend.load_model("mistralai/Mistral-7B-Instruct-v0.2", ModelConfig::default())?;

// Generate with PagedAttention + X-LoRA
let response = backend.generate("Write secure authentication code", GenerateParams {
    max_tokens: 512,
    temperature: 0.7,
    ..Default::default()
})?;

When to Use mistral-rs vs Candle

Scenario	Recommended Backend	Reason
Single user / Edge	Candle	Simpler, smaller binary
10-100 concurrent users	mistral-rs	PagedAttention memory efficiency
Multi-task models	mistral-rs	X-LoRA per-token routing
Runtime quantization	mistral-rs	ISQ without model re-export
WASM / Browser	Candle	mistral-rs doesn't support WASM

Feature Flags

# Enable mistral-rs (when available on crates.io)
ruvllm = { version = "2.3", features = ["mistral-rs"] }

# With Metal acceleration (Apple Silicon)
ruvllm = { version = "2.3", features = ["mistral-rs-metal"] }

# With CUDA acceleration (NVIDIA)
ruvllm = { version = "2.3", features = ["mistral-rs-cuda"] }

See ADR-008: mistral-rs Integration for detailed architecture decisions.

Configuration

Environment Variables

Variable	Description	Default
`RUVLLM_CACHE_DIR`	Model cache directory	`~/.cache/ruvllm`
`RUVLLM_LOG_LEVEL`	Logging level	`info`
`RUVLLM_METAL_DEVICE`	Metal device index	`0`
`RUVLLM_ANE_ENABLED`	Enable ANE routing	`true`
`RUVLLM_SONA_ENABLED`	Enable SONA learning	`true`

Model Configuration

let config = ModelConfig {
    max_context: 8192,
    use_flash_attention: true,
    quantization: Quantization::Q4K,
    kv_cache_config: KvCacheConfig::default(),
    rope_scaling: Some(RopeScaling::Linear { factor: 2.0 }),
    sliding_window: Some(4096),
    ..Default::default()
};

Benchmarks

Run benchmarks with:

# Attention benchmarks
cargo bench --bench attention_bench --features inference-metal

# ANE benchmarks (Mac only)
cargo bench --bench ane_bench --features coreml

# LoRA benchmarks
cargo bench --bench lora_bench

# End-to-end inference
cargo bench --bench e2e_bench --features inference-metal

# Metal shader benchmarks
cargo bench --bench metal_bench --features metal-compute

# Serving benchmarks
cargo bench --bench serving_bench --features inference-metal

HuggingFace Hub Integration (v2.3)

Download and upload models to HuggingFace Hub:

use ruvllm::hub::{ModelDownloader, ModelUploader, RuvLtraRegistry, DownloadConfig};

// Download from Hub
let downloader = ModelDownloader::new(DownloadConfig::default());
let model_path = downloader.download(
    "ruvector/ruvltra-small-q4km",
    Some("./models"),
)?;

// Or use the registry for RuvLTRA models
let registry = RuvLtraRegistry::new();
let model = registry.get("ruvltra-medium", "Q4_K_M")?;

// Upload to Hub (requires HF_TOKEN)
let uploader = ModelUploader::new("hf_your_token");
let url = uploader.upload(
    "./my-model.gguf",
    "username/my-ruvltra-model",
    Some(metadata),
)?;
println!("Uploaded to: {}", url);

Task-Specific LoRA Adapters (v2.3)

Pre-trained adapters optimized for Claude Flow agent types:

use ruvllm::lora::{RuvLtraAdapters, AdapterTrainer, AdapterMerger, HotSwapManager};

// Create adapter for specific task
let adapters = RuvLtraAdapters::new();
let coder = adapters.create_lora("coder", 768)?;       // Rank 16, code generation
let security = adapters.create_lora("security", 768)?; // Rank 16, vulnerability detection

// Available adapters:
// - coder:     Rank 16, Alpha 32.0, targets attention (Q,K,V,O)
// - researcher: Rank 8, Alpha 16.0, targets Q,K,V
// - security:  Rank 16, Alpha 32.0, targets attention + MLP
// - architect: Rank 12, Alpha 24.0, targets Q,V + Gate,Up
// - reviewer:  Rank 8, Alpha 16.0, targets Q,V

// Merge adapters for multi-task models
let merger = AdapterMerger::new(MergeConfig::weighted(weights));
let multi_task = merger.merge(&[coder, security], &output_config, 768)?;

// Hot-swap adapters at runtime
let mut manager = HotSwapManager::new();
manager.set_active(coder);
manager.prepare_standby(security);
manager.swap()?; // Zero-downtime switch

Adapter Merging Strategies

Strategy	Description	Use Case
Average	Equal-weight averaging	Simple multi-task
WeightedSum	User-defined weights	Task importance weighting
SLERP	Spherical interpolation	Smooth transitions
TIES	Trim, Elect, Merge	Robust multi-adapter
DARE	Drop And REscale	Sparse merging
TaskArithmetic	Add/subtract vectors	Task composition

Evaluation Harness (v2.3)

RuvLLM includes a comprehensive evaluation harness for benchmarking model quality:

use ruvllm::evaluation::{RealEvaluationHarness, EvalConfig, AblationMode};

// Create harness with GGUF model
let harness = RealEvaluationHarness::with_gguf(
    "./models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
    EvalConfig::default(),
)?;

// Run single evaluation
let result = harness.evaluate(
    "Fix the null pointer exception in this code",
    "def process(data):\n    return data.split()",
    AblationMode::Full,
)?;

println!("Success: {}, Quality: {:.2}", result.success, result.quality_score);

// Run full ablation study (5 modes)
let report = harness.run_ablation_study(&tasks)?;
for (mode, metrics) in &report.mode_metrics {
    println!("{:?}: {:.1}% success, {:.2} quality",
        mode, metrics.success_rate * 100.0, metrics.avg_quality);
}

Ablation Modes

Mode	Description	Use Case
Baseline	No enhancements	Control baseline
RetrievalOnly	HNSW pattern retrieval	Measure retrieval impact
AdaptersOnly	LoRA adapters	Measure adaptation impact
RetrievalPlusAdapters	HNSW + LoRA	Combined without SONA
Full	All systems (SONA + HNSW + LoRA)	Production mode

SWE-Bench Task Loader

use ruvllm::evaluation::swe_bench::SweBenchLoader;

// Load SWE-Bench tasks
let loader = SweBenchLoader::new();
let tasks = loader.load_subset("lite", 50)?; // 50 tasks from lite subset

for task in &tasks {
    println!("Instance: {}", task.instance_id);
    println!("Problem: {}", task.problem_statement);
}

CLI Evaluation

# Run evaluation with default settings
cargo run --example run_eval --features async-runtime -- \
    --model ./models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

# Run SWE-Bench subset
cargo run --example run_eval --features async-runtime -- \
    --model ./models/model.gguf \
    --swe-bench-path ./data/swe-bench \
    --subset lite \
    --max-tasks 100

# Output report
cargo run --example run_eval --features async-runtime -- \
    --model ./models/model.gguf \
    --output ./reports/eval-report.json

HNSW Auto-Dimension Detection

The evaluation harness automatically detects model embedding dimensions:

// HNSW router automatically uses model's hidden_size
// TinyLlama 1.1B → 2048 dimensions
// Qwen2 0.5B → 896 dimensions
// RuvLTRA-Small → 896 dimensions
// RuvLTRA-Medium → 2560 dimensions

let harness = RealEvaluationHarness::with_config(
    EvalConfig::default(),
    RealInferenceConfig {
        enable_hnsw: true,
        hnsw_config: None, // Auto-detect from model
        ..Default::default()
    },
)?;

Examples

See the /examples directory for:

download_test_model.rs - Download and validate models
benchmark_model.rs - Full inference benchmarking
run_eval.rs - Run evaluation harness with SWE-Bench
Basic inference
Streaming generation
MicroLoRA adaptation
Multi-turn chat
Speculative decoding
Continuous batching
ANE hybrid inference

Error Handling

use ruvllm::error::{Result, RuvLLMError};

match backend.generate(prompt, params) {
    Ok(response) => println!("{}", response),
    Err(RuvLLMError::Model(e)) => eprintln!("Model error: {}", e),
    Err(RuvLLMError::OutOfMemory(e)) => eprintln!("OOM: {}", e),
    Err(RuvLLMError::Generation(e)) => eprintln!("Generation failed: {}", e),
    Err(RuvLLMError::Ane(e)) => eprintln!("ANE error: {}", e),
    Err(RuvLLMError::Gguf(e)) => eprintln!("GGUF loading error: {}", e),
    Err(e) => eprintln!("Error: {}", e),
}

npm Package

RuvLLM is also available as an npm package with native bindings:

npm install @ruvector/ruvllm

import { RuvLLM } from '@ruvector/ruvllm';

const llm = new RuvLLM();
const response = llm.query('Explain quantum computing');
console.log(response.text);

See @ruvector/ruvllm on npm for full documentation.

License

Apache-2.0 / MIT dual license.

Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

ruvllm 2.0.3

RuvLLM v2.3 - High-Performance LLM Inference for Rust

What's New in v2.3

Major Features

Previous v2.0-2.2 Features

Features

Multiple Backends

Optimized Kernels

Real-Time Learning (SONA)

Memory Efficiency

Quick Start

Installation

Feature Flags

Architecture

Supported Models

RuvLTRA Models (Claude Flow Optimized)

Performance (M4 Pro 14-core)

Inference Benchmarks

ANE vs GPU Performance (M4 Pro)

Kernel Benchmarks

Apple Neural Engine (ANE) Integration

ANE Routing Recommendations

MicroLoRA Real-Time Adaptation

SONA Three-Tier Learning

Two-Tier KV Cache

Continuous Batching

Speculative Decoding

GGUF Model Loading

mistral-rs Backend (Production Serving)

Key Features

Usage Example

When to Use mistral-rs vs Candle

Feature Flags

Configuration

Environment Variables

Model Configuration

Benchmarks

HuggingFace Hub Integration (v2.3)

Task-Specific LoRA Adapters (v2.3)

Adapter Merging Strategies

Evaluation Harness (v2.3)

Ablation Modes

SWE-Bench Task Loader

CLI Evaluation

HNSW Auto-Dimension Detection

Examples

Error Handling

npm Package

License

Contributing

Links