ruvllm 2.0.1

LLM serving runtime with Ruvector integration - Paged attention, KV cache, and SONA learning
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
# RuvLLM v2.0 - High-Performance LLM Inference for Rust

RuvLLM is a production-ready Rust LLM inference engine optimized for Apple Silicon (M1-M4), featuring real-time fine-tuning, NEON SIMD acceleration, Apple Neural Engine integration, and the SONA self-optimizing neural architecture.

## What's New in v2.0

### Major Features

| Feature | Description | Benefit |
|---------|-------------|---------|
| **RLM (Recursive Language Model)** | Recursive query decomposition for complex reasoning | Break down complex questions, parallel sub-query processing |
| **RuvLTRA-Medium 3B** | Purpose-built 3B model for Claude Flow | 42 layers, 256K context, speculative decode |
| **HuggingFace Hub** | Full Hub integration (download/upload) | Easy model sharing & distribution |
| **Task-Specific LoRA** | 5 pre-trained adapters for agent types | Optimized for coder/researcher/security/architect/reviewer |
| **Adapter Merging** | TIES, DARE, SLERP, Task Arithmetic | Combine adapters for multi-task models |
| **Hot-Swap Adapters** | Zero-downtime adapter switching | Runtime task specialization |
| **WASM Support** | WebAssembly target for browser-based inference | Run LLMs in the browser |
| **HNSW Routing** | 150x faster semantic pattern matching | <25us pattern retrieval |

### Performance Optimizations (NEW)

The v2.0 release includes significant performance improvements across all hot paths:

| Optimization | Description | Benefit |
|--------------|-------------|---------|
| **HNSW Index** | O(log n) approximate nearest neighbor search | 10x faster at 10k entries vs linear scan |
| **O(1) LRU Cache** | Using `lru` crate for cache operations | 23.5ns cache lookup (vs 500ns+ HashMap) |
| **Zero-Copy Types** | `Arc<str>`, `Arc<[f32]>` for shared data | 100-1000x improvement in cache hit paths |
| **Batch SIMD** | AVX2/NEON vectorized batch operations | 4x throughput for similarity search |
| **Memory Pools** | Pre-allocated vector/string pools | 50% fewer allocations in hot paths |

### Benchmark Results

Measured on Apple M4 Pro with 384-dimensional embeddings:

| Operation | Performance | Notes |
|-----------|-------------|-------|
| Query decomposition | 340 ns | Pattern-based keyword extraction |
| Cache lookup | 23.5 ns | O(1) LRU with FNV-1a hashing |
| Memory search (10k entries) | ~0.4 ms | With HNSW index (vs 4ms linear) |
| Embeddings (384d) | 293 ns | SIMD-accelerated dot product |
| Batch cosine (4x384d) | ~1.1 us | AVX2/NEON batch processing |
| Pool acquire/release | <100 ns | Zero-allocation in steady state |

### New Optimization Modules

| Module | Purpose | Key Types |
|--------|---------|-----------|
| `rlm/pool.rs` | Memory pools for allocation reuse | `VectorPool`, `StringPool`, `PooledVec` |
| `rlm/shared_types.rs` | Zero-copy shared types | `SharedText`, `SharedEmbedding`, `SharedQueryResult` |
| `rlm/simd_ops.rs` | SIMD-accelerated vector operations | `batch_cosine_similarity_4`, `batch_dot_products` |
| `rlm/cache.rs` | O(1) LRU memoization cache | `MemoizationCache`, `CacheEntry` |

### RLM (Recursive Language Model) Architecture

RLM provides a sophisticated recursive reasoning pipeline:

```text
+------------------+
|  RlmController   |  <-- Main entry point
+--------+---------+
         |
         v
+--------+---------+
| QueryDecomposer  |  <-- Breaks complex queries
+--------+---------+
         |
   +-----+-----+
   |           |
+--v--+     +--v--+
|Sub  |     |Sub  |  <-- Parallel sub-query processing
|Query|     |Query|
+--+--+     +--+--+
   |           |
   +-----+-----+
         |
         v
+--------+---------+
|AnswerSynthesizer |  <-- Combines sub-answers
+--------+---------+
         |
         v
+--------+---------+
|   RlmMemory      |  <-- HNSW-indexed retrieval
+-----------------+
```

### RLM Quick Start

```rust
use ruvllm::rlm::{RlmController, RlmConfig, NativeEnvironment};

// Create controller with default config
let config = RlmConfig::default();
let controller = RlmController::<NativeEnvironment>::new(config)?;

// Query the model with recursive decomposition
let response = controller.query("What is recursive language modeling and how does it improve reasoning?")?;
println!("Answer: {}", response.text);

// Add to memory for future retrieval
controller.add_memory("RLM uses recursive decomposition.", Default::default())?;

// Search memory semantically
let results = controller.search_memory("recursive", 5)?;
```

### RLM Configuration

```rust
use ruvllm::rlm::{RecursiveConfig, RecursiveConfigBuilder, AggregationStrategy};

let config = RecursiveConfigBuilder::new()
    .max_depth(5)                              // Maximum recursion depth
    .token_budget(16000)                       // Total token budget
    .enable_cache(true)                        // Enable memoization
    .aggregation(AggregationStrategy::WeightedMerge)
    .parallel_subqueries(true)                 // Process sub-queries in parallel
    .build()?;
```

### Previous Features (v1.x-2.x)

| Feature | Description | Benefit |
|---------|-------------|---------|
| **Apple Neural Engine** | Core ML backend with ANE routing | 38 TOPS, 3-4x power efficiency |
| **Hybrid GPU+ANE Pipeline** | Intelligent operation routing | Best of both accelerators |
| **Multi-threaded GEMM** | Rayon parallelization | 4-12x speedup on M4 Pro |
| **Flash Attention 2** | Auto block sizing, online softmax | O(N) memory, +10% throughput |
| **Quantized Inference** | INT8/INT4/Q4_K/Q8_K kernels | 4-8x memory reduction |
| **Metal GPU Shaders** | simdgroup_matrix operations | 3x speedup on Apple Silicon |
| **GGUF Support** | Memory-mapped model loading | Fast loading, reduced RAM |
| **Continuous Batching** | Dynamic batch scheduling | 2-3x throughput improvement |
| **Speculative Decoding** | Draft model acceleration | 2-3x faster generation |
| **Gemma-2 & Phi-3** | New model architectures | Extended model support |

## Features

### Multiple Backends
- **Candle Backend**: HuggingFace's Candle framework with Metal/CUDA GPU acceleration
- **Core ML Backend**: Apple Neural Engine for maximum efficiency on Apple Silicon
- **Hybrid Pipeline**: Automatic routing between GPU and ANE based on operation type
- **RuvLTRA Backend**: Custom backend optimized for Claude Flow integration

### Optimized Kernels
- **NEON SIMD**: ARM64-optimized kernels with 4x loop unrolling and FMA instructions
- **Flash Attention 2**: Memory-efficient attention with O(N) complexity and online softmax
- **Paged Attention**: Efficient KV cache management for long-context inference
- **ANE Operations**: GELU, SiLU, softmax, layer norm optimized for Neural Engine

### Real-Time Learning (SONA)
- **MicroLoRA**: Per-request fine-tuning with rank 1-2 adapters (<1ms latency)
- **EWC++**: Elastic Weight Consolidation to prevent catastrophic forgetting
- **Three-Tier Learning**: Instant (<1ms), Background (~100ms), Deep (minutes)

### Memory Efficiency
- **Two-Tier KV Cache**: FP16 tail + Q4/Q8 quantized store
- **Grouped-Query Attention (GQA)**: 4-8x KV memory reduction
- **Memory Pool**: Arena allocator for zero-allocation inference
- **GGUF Memory Mapping**: Efficient large model loading

## Quick Start

```rust
use ruvllm::prelude::*;

// Initialize backend with Metal GPU + ANE hybrid
let mut backend = CandleBackend::with_device(DeviceType::Metal)?;

// Load a GGUF model
backend.load_gguf("models/qwen2.5-7b-q4_k.gguf", ModelConfig::default())?;

// Or load from HuggingFace
backend.load_model("Qwen/Qwen2.5-7B-Instruct", ModelConfig {
    quantization: Quantization::Q4K,
    use_flash_attention: true,
    ..Default::default()
})?;

// Generate text
let response = backend.generate("Explain quantum computing in simple terms.",
    GenerateParams {
        max_tokens: 256,
        temperature: 0.7,
        top_p: 0.9,
        ..Default::default()
    }
)?;

println!("{}", response);

// Check SONA learning stats
if let Some(stats) = backend.sona_stats() {
    println!("Patterns learned: {}", stats.patterns_learned);
    println!("Quality improvement: {:.1}%", stats.quality_improvement * 100.0);
}
```

## Installation

Add to your `Cargo.toml`:

```toml
[dependencies]
# Recommended for Apple Silicon Mac
ruvllm = { version = "2.0", features = ["inference-metal", "coreml", "parallel"] }

# For NVIDIA GPUs
ruvllm = { version = "2.0", features = ["inference-cuda", "parallel"] }

# With RLM recursive reasoning
ruvllm = { version = "2.0", features = ["rlm-full"] }

# Minimal (CPU only)
ruvllm = { version = "2.0" }
```

### Feature Flags

| Feature | Description |
|---------|-------------|
| `candle` | Enable Candle backend (HuggingFace) |
| `metal` | Apple Silicon GPU acceleration via Candle |
| `metal-compute` | Native Metal compute shaders (M4 Pro optimized) |
| `cuda` | NVIDIA GPU acceleration |
| `coreml` | Apple Neural Engine via Core ML |
| `hybrid-ane` | GPU+ANE hybrid pipeline (recommended for Mac) |
| `inference-metal` | Full Metal inference stack |
| `inference-metal-native` | Metal + native shaders (best M4 Pro perf) |
| `inference-cuda` | Full CUDA inference stack |
| `parallel` | Multi-threaded GEMM/GEMV with Rayon |
| `accelerate` | Apple Accelerate BLAS (~2x GEMV speedup) |
| `gguf-mmap` | Memory-mapped GGUF loading |
| `async-runtime` | Tokio async support |
| `wasm` | WebAssembly support |
| **`rlm-core`** | RLM recursive reasoning core (includes cache, pools, SIMD) |
| **`rlm-wasm`** | RLM with WASM support for browsers |
| **`rlm-full`** | Full RLM with async runtime |
| `attention` | Ruvector attention mechanisms |
| `graph` | Ruvector graph integration |
| `gnn` | Graph neural network support |
| `ruvector-full` | All Ruvector integrations |

## Architecture

```
+----------------------------------+
|         Application              |
+----------------------------------+
              |
+----------------------------------+
|        RuvLLM Backend            |
|  +----------------------------+  |
|  |   Hybrid Pipeline Router   |  |
|  |  +----------+ +----------+ |  |
|  |  |  Metal   | |   ANE    | |  |
|  |  |   GPU    | | Core ML  | |  |
|  |  +----+-----+ +----+-----+ |  |
|  |       |    v      |        |  |
|  |  Attention    MLP/FFN      |  |
|  |  RoPE         Activations  |  |
|  |  Softmax      LayerNorm    |  |
|  +----------------------------+  |
|              |                   |
|  +----------------------------+  |
|  |     SONA Learning          |  |
|  |  - Instant (<1ms)          |  |
|  |  - Background (~100ms)     |  |
|  |  - Deep (minutes)          |  |
|  +----------------------------+  |
|              |                   |
|  +----------------------------+  |
|  |     NEON/SIMD Kernels      |  |
|  |  - Flash Attention 2       |  |
|  |  - Paged KV Cache          |  |
|  |  - Quantized MatMul        |  |
|  +----------------------------+  |
+----------------------------------+
```

## Supported Models

| Model Family | Sizes | Quantization | Backend |
|--------------|-------|--------------|---------|
| **RuvLTRA-Small** | 0.5B | Q4K, Q5K, Q8, FP16 | Candle/Metal/ANE |
| **RuvLTRA-Medium** | 3B | Q4K, Q5K, Q8, FP16 | Candle/Metal |
| Qwen 2.5 | 0.5B-72B | Q4K, Q8, FP16 | Candle/Metal |
| Llama 3.x | 8B-70B | Q4K, Q8, FP16 | Candle/Metal |
| Mistral | 7B-22B | Q4K, Q8, FP16 | Candle/Metal |
| Phi-3 | 3.8B-14B | Q4K, Q8, FP16 | Candle/Metal |
| Gemma-2 | 2B-27B | Q4K, Q8, FP16 | Candle/Metal |

### RuvLTRA Models (Claude Flow Optimized)

| Model | Parameters | Hidden | Layers | Context | Features |
|-------|------------|--------|--------|---------|----------|
| RuvLTRA-Small | 494M | 896 | 24 | 32K | GQA 7:1, SONA hooks |
| RuvLTRA-Medium | 3.0B | 2560 | 42 | 256K | Flash Attention 2, Speculative Decode |

### HuggingFace Model Links

Pre-trained RuvLTRA models are available on HuggingFace:

- **Repository**: [huggingface.co/ruv/ruvltra]https://huggingface.co/ruv/ruvltra

| Model | File | Size | Purpose |
|-------|------|------|---------|
| RuvLTRA Claude Code 0.5B | `ruvltra-claude-code-0.5b-q4_k_m.gguf` | ~400MB | Agent routing (100% accuracy with hybrid) |
| RuvLTRA Small 0.5B | `ruvltra-0.5b-q4_k_m.gguf` | ~400MB | General embeddings |
| RuvLTRA Medium 3B | `ruvltra-3b-q4_k_m.gguf` | ~2GB | Full LLM inference |

Download models:
```bash
# Using huggingface-cli
huggingface-cli download ruv/ruvltra ruvltra-claude-code-0.5b-q4_k_m.gguf --local-dir ~/.ruvllm/models

# Or via the API
curl -L https://huggingface.co/ruv/ruvltra/resolve/main/ruvltra-claude-code-0.5b-q4_k_m.gguf -o ~/.ruvllm/models/ruvltra-claude-code-0.5b-q4_k_m.gguf
```

## Performance Benchmarks

### Inference (M4 Pro 14-core)

| Model | Quant | Prefill (tok/s) | Decode (tok/s) | Memory |
|-------|-------|-----------------|----------------|--------|
| Qwen2.5-7B | Q4K | 2,800 | 95 | 4.2 GB |
| Qwen2.5-7B | Q8 | 2,100 | 72 | 7.8 GB |
| Llama3-8B | Q4K | 2,600 | 88 | 4.8 GB |
| Mistral-7B | Q4K | 2,500 | 85 | 4.1 GB |
| Phi-3-3.8B | Q4K | 3,500 | 135 | 2.3 GB |
| Gemma2-9B | Q4K | 2,200 | 75 | 5.2 GB |

### RLM Decomposition Performance

| Query Complexity | Sub-queries | Decomposition Time | Total Time |
|-----------------|-------------|-------------------|------------|
| Simple | 1 | <1ms | 50-100ms |
| Moderate | 2-3 | 2-5ms | 150-300ms |
| Complex | 4-6 | 5-10ms | 400-800ms |
| Deep reasoning | 6-10 | 10-20ms | 1-3s |

### ANE vs GPU Performance (M4 Pro)

| Dimension | ANE | GPU | Winner |
|-----------|-----|-----|--------|
| < 512 | +30-50% | - | ANE |
| 512-1024 | +10-30% | - | ANE |
| 1024-1536 | ~Similar | ~Similar | Either |
| 1536-2048 | - | +10-20% | GPU |
| > 2048 | - | +30-50% | GPU |

### Kernel Benchmarks

| Kernel | Single-thread | Multi-thread (10-core) |
|--------|---------------|------------------------|
| GEMM 4096x4096 | 1.2 GFLOPS | 12.7 GFLOPS |
| GEMV 4096x4096 | 0.8 GFLOPS | 6.4 GFLOPS |
| Flash Attention (seq=2048) | 850us | 320us |
| RMS Norm (4096) | 2.1us | 0.8us |
| RoPE (4096, 128) | 4.3us | 1.6us |

## RLM Usage Examples

### Basic Recursive Query

```rust
use ruvllm::rlm::{RlmController, RlmConfig, NativeEnvironment};

let controller = RlmController::<NativeEnvironment>::new(RlmConfig::default())?;

// Complex query gets automatically decomposed
let result = controller.query(
    "Compare the economic policies of keynesian and monetarist schools,
     and explain how each would respond to stagflation"
)?;

println!("Answer: {}", result.text);
println!("Sub-queries processed: {}", result.stats.sub_queries_count);
println!("Memory retrievals: {}", result.stats.memory_hits);
```

### With Memory Context

```rust
// Add domain knowledge to memory
controller.add_memory(
    "Keynesian economics emphasizes government intervention and aggregate demand",
    Default::default()
)?;
controller.add_memory(
    "Monetarism focuses on controlling money supply to manage inflation",
    Default::default()
)?;

// Query now uses memory context
let result = controller.query("What are the key differences between these economic schools?")?;
```

### Custom Decomposition Strategy

```rust
use ruvllm::rlm::{RecursiveConfigBuilder, AggregationStrategy, DecomposerStrategy};

let config = RecursiveConfigBuilder::new()
    .max_depth(3)
    .token_budget(8000)
    .decomposition_strategy(DecomposerStrategy::Semantic)
    .aggregation(AggregationStrategy::Summarize)
    .build()?;

let controller = RlmController::<NativeEnvironment>::new(config)?;
```

### Using Memory Pools for High-Throughput

```rust
use ruvllm::rlm::pool::{VectorPool, PoolManager};

// Create pre-warmed pools for embedding operations
let vector_pool = VectorPool::new_warmed(384, 64, 32);

// Or use the pool manager for convenience
let manager = PoolManager::warmed(384, 64, 128, 32);

// Acquire vectors from pool (zero allocation if pool has capacity)
let mut embedding = manager.vector_pool.acquire();
embedding.extend_from_slice(&query_embedding);

// Vector automatically returns to pool on drop
// Check pool statistics
let stats = manager.stats();
println!("Hit rate: {:.1}%", stats.overall_hit_rate() * 100.0);
println!("Allocations saved: {}", stats.total_allocations_saved());
```

### WASM Usage (Browser)

```rust
#[cfg(target_arch = "wasm32")]
use ruvllm::rlm::{WasmRlmController, WasmEnvironment};

#[cfg(target_arch = "wasm32")]
async fn query_in_browser() -> Result<String, JsValue> {
    let controller = WasmRlmController::new(Default::default()).await?;
    let result = controller.query("What is machine learning?").await?;
    Ok(result.text)
}
```

## Apple Neural Engine (ANE) Integration

RuvLLM includes full ANE support via Core ML:

```rust
use ruvllm::backends::coreml::{CoreMLBackend, AneStrategy};

// Create ANE-optimized backend
let backend = CoreMLBackend::new(AneStrategy::PreferAneForMlp)?;

// Or use hybrid pipeline for best performance
use ruvllm::backends::HybridPipeline;

let pipeline = HybridPipeline::new(HybridConfig {
    ane_strategy: AneStrategy::Adaptive,
    gpu_for_attention: true,  // Attention on GPU
    ane_for_mlp: true,        // MLP/FFN on ANE
    ..Default::default()
})?;
```

### ANE Routing Recommendations

| Operation | Recommended | Reason |
|-----------|-------------|--------|
| Attention | GPU | Better for variable sequence lengths |
| Flash Attention | GPU | GPU memory bandwidth advantage |
| MLP/FFN | ANE | Optimal for fixed-size matmuls |
| GELU/SiLU | ANE | Dedicated activation units |
| LayerNorm/RMSNorm | ANE | Good for small dimensions |
| Embedding | GPU | Sparse operations |

## MicroLoRA Real-Time Adaptation

RuvLLM supports per-request fine-tuning using MicroLoRA:

```rust
use ruvllm::lora::{MicroLoRA, MicroLoraConfig, AdaptFeedback};

// Create MicroLoRA adapter
let config = MicroLoraConfig::for_hidden_dim(4096);
let lora = MicroLoRA::new(config);

// Adapt on user feedback
let feedback = AdaptFeedback::from_quality(0.9);
lora.adapt(&input_embedding, feedback)?;

// Apply learned updates
lora.apply_updates(0.01); // learning rate

// Get adaptation stats
let stats = lora.stats();
println!("Samples: {}, Avg quality: {:.2}", stats.samples, stats.avg_quality);
```

## SONA Three-Tier Learning

Continuous improvement with three learning loops:

```rust
use ruvllm::optimization::{SonaLlm, SonaLlmConfig, ConsolidationStrategy};

let config = SonaLlmConfig {
    instant_lr: 0.01,
    background_interval_ms: 100,
    deep_trigger_threshold: 100.0,
    consolidation_strategy: ConsolidationStrategy::EwcMerge,
    ..Default::default()
};

let sona = SonaLlm::new(config);

// 1. Instant Loop (<1ms): Per-request MicroLoRA
let result = sona.instant_adapt("user query", "model response", 0.85);
println!("Instant adapt: {}us", result.latency_us);

// 2. Background Loop (~100ms): Pattern consolidation
if let result = sona.maybe_background() {
    if result.applied {
        println!("Consolidated {} samples", result.samples_used);
    }
}

// 3. Deep Loop (minutes): Full optimization
if sona.should_trigger_deep() {
    let result = sona.deep_optimize(OptimizationTrigger::QualityThreshold(100.0));
    println!("Deep optimization: {:.1}s", result.latency_us as f64 / 1_000_000.0);
}
```

## Two-Tier KV Cache

Memory-efficient caching with automatic tiering:

```rust
use ruvllm::kv_cache::{TwoTierKvCache, KvCacheConfig};

let config = KvCacheConfig {
    tail_length: 256,              // Recent tokens in FP16
    tail_precision: Precision::FP16,
    store_precision: Precision::Q4,  // Older tokens in Q4
    max_tokens: 8192,
    num_layers: 32,
    num_kv_heads: 8,
    head_dim: 128,
};

let cache = TwoTierKvCache::new(config);
cache.append(&keys, &values)?;

// Automatic migration from tail to quantized store
let stats = cache.stats();
println!("Tail: {} tokens, Store: {} tokens", stats.tail_tokens, stats.store_tokens);
println!("Compression ratio: {:.2}x", stats.compression_ratio);
```

## HuggingFace Hub Integration

Download and upload models to HuggingFace Hub:

```rust
use ruvllm::hub::{ModelDownloader, ModelUploader, RuvLtraRegistry, DownloadConfig};

// Download from Hub
let downloader = ModelDownloader::new(DownloadConfig::default());
let model_path = downloader.download(
    "ruvector/ruvltra-small-q4km",
    Some("./models"),
)?;

// Or use the registry for RuvLTRA models
let registry = RuvLtraRegistry::new();
let model = registry.get("ruvltra-medium", "Q4_K_M")?;

// Upload to Hub (requires HF_TOKEN)
let uploader = ModelUploader::new("hf_your_token");
let url = uploader.upload(
    "./my-model.gguf",
    "username/my-ruvltra-model",
    Some(metadata),
)?;
println!("Uploaded to: {}", url);
```

## Configuration

### Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `RUVLLM_CACHE_DIR` | Model cache directory | `~/.cache/ruvllm` |
| `RUVLLM_LOG_LEVEL` | Logging level | `info` |
| `RUVLLM_METAL_DEVICE` | Metal device index | `0` |
| `RUVLLM_ANE_ENABLED` | Enable ANE routing | `true` |
| `RUVLLM_SONA_ENABLED` | Enable SONA learning | `true` |
| `HF_TOKEN` | HuggingFace API token | - |

### Model Configuration

```rust
let config = ModelConfig {
    max_context: 8192,
    use_flash_attention: true,
    quantization: Quantization::Q4K,
    kv_cache_config: KvCacheConfig::default(),
    rope_scaling: Some(RopeScaling::Linear { factor: 2.0 }),
    sliding_window: Some(4096),
    ..Default::default()
};
```

## Benchmarks

Run benchmarks with:

```bash
# Attention benchmarks
cargo bench --bench attention_bench --features inference-metal

# ANE benchmarks (Mac only)
cargo bench --bench ane_bench --features coreml

# LoRA benchmarks
cargo bench --bench lora_bench

# RLM benchmarks
cargo bench --bench rlm_bench --features rlm-full

# End-to-end inference
cargo bench --bench e2e_bench --features inference-metal

# Metal shader benchmarks
cargo bench --bench metal_bench --features metal-compute

# Serving benchmarks
cargo bench --bench serving_bench --features inference-metal

# RuvLTRA router benchmarks
cargo bench --bench ruvltra_benchmark
```

## npm Package

RuvLLM is also available as an npm package with native bindings:

```bash
npm install @ruvector/ruvllm
```

```typescript
import { RuvLLM } from '@ruvector/ruvllm';

const llm = new RuvLLM();
const response = llm.query('Explain quantum computing');
console.log(response.text);
```

See [@ruvector/ruvllm on npm](https://www.npmjs.com/package/@ruvector/ruvllm) for full documentation.

## Error Handling

```rust
use ruvllm::error::{Result, RuvLLMError};

match backend.generate(prompt, params) {
    Ok(response) => println!("{}", response),
    Err(RuvLLMError::Model(e)) => eprintln!("Model error: {}", e),
    Err(RuvLLMError::OutOfMemory(e)) => eprintln!("OOM: {}", e),
    Err(RuvLLMError::Generation(e)) => eprintln!("Generation failed: {}", e),
    Err(RuvLLMError::Ane(e)) => eprintln!("ANE error: {}", e),
    Err(RuvLLMError::Gguf(e)) => eprintln!("GGUF loading error: {}", e),
    Err(e) => eprintln!("Error: {}", e),
}
```

## License

Apache-2.0 / MIT dual license.

## Contributing

Contributions welcome! Please see [CONTRIBUTING.md](../../CONTRIBUTING.md) for guidelines.

## Links

- [GitHub Repository]https://github.com/ruvnet/ruvector
- [API Documentation]https://docs.rs/ruvllm
- [npm Package]https://www.npmjs.com/package/@ruvector/ruvllm
- [Issue Tracker]https://github.com/ruvnet/ruvector/issues
- [crates.io]https://crates.io/crates/ruvllm
- [HuggingFace Models]https://huggingface.co/ruv/ruvltra