llama-rs 0.17.0

A high-performance Rust implementation of llama.cpp - LLM inference engine with full GGUF support
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
# llama-rs

A high-performance Rust implementation of [llama.cpp](https://github.com/ggerganov/llama.cpp) - an LLM inference engine with full GGUF and ONNX support.

[![Crates.io](https://img.shields.io/crates/v/llama-rs.svg)](https://crates.io/crates/llama-rs)
[![License](https://img.shields.io/crates/l/llama-rs.svg)](LICENSE-MIT)

## Features

- **Full GGUF Support** - Load any GGUF model file compatible with llama.cpp
- **ONNX Support** - Load HuggingFace Optimum ONNX exports (F32, F16, BF16 with auto-conversion)
- **Multiple Architectures** - LLaMA, Mistral, Qwen2, Qwen3/Qwen3Next, Mixtral, TinyLlama, DeepSeek, and more
- **Quantization** - All K-quant formats (Q2_K through Q8_0) plus F16/F32
- **HuggingFace Integration** - Download models directly from HuggingFace Hub
- **Fast CPU Inference** - SIMD-optimized (AVX2, AVX-512, NEON)
- **GPU Inference** - Full GPU-resident inference on CUDA; Metal, DX12, Vulkan via Backend trait
- **Mixture of Experts** - MoE support with top-k routing (Mixtral, Qwen3Moe, DeepSeek)
- **DeltaNet/SSM** - Gated DeltaNet recurrent layers for hybrid attention/SSM models (Qwen3Next)
- **Distributed Inference** - Pipeline-parallel inference across multiple nodes via gRPC
- **RAG** - Retrieval-Augmented Generation with PostgreSQL/pgvector vector store
- **OpenAI-compatible API** - HTTP server with streaming support
- **Grouped Query Attention** - Efficient KV cache for GQA models
- **Streaming Output** - Token-by-token generation

## Installation

### From crates.io

```bash
cargo install llama-rs
```

### From Source

```bash
git clone https://github.com/Lexmata/llama-rs.git
cd llama-rs
cargo build --release
```

The binary will be at `target/release/llama-rs`.

### System Installation with Man Pages

**Option 1: Using cargo install (generates man pages from CLI)**

```bash
cargo install llama-rs

# Generate and install man pages
llama-rs manpages ~/.local/share/man/man1
mandb -u

# Or system-wide (requires sudo)
sudo llama-rs manpages /usr/local/share/man/man1
sudo mandb
```

**Option 2: Using make (includes detailed hand-written man pages)**

```bash
git clone https://github.com/Lexmata/llama-rs.git
cd llama-rs

# Build and install to /usr/local (requires sudo)
sudo make install

# Or install to a custom prefix
make PREFIX=~/.local install

# Install man pages only
sudo make install-man
```

After installation, access documentation with:

```bash
man llama-rs           # Main command overview
man llama-rs-run       # Run inference
man llama-rs-chat      # Interactive chat
man llama-rs-serve     # HTTP server
man llama-rs-rag       # RAG operations
```

### As a Library

```toml
[dependencies]
llama-rs = "0.10"
```

## Quick Start

### Download a Model

```bash
# List available files in a repository
llama-rs download Qwen/Qwen2.5-0.5B-Instruct-GGUF

# Download a specific quantized model
llama-rs download Qwen/Qwen2.5-0.5B-Instruct-GGUF -f qwen2.5-0.5b-instruct-q4_k_m.gguf
```

### Run Inference

```bash
# Basic text generation (GGUF)
llama-rs run model.gguf -p "Hello, world!" -n 50

# ONNX model (requires config.json and tokenizer.json in same directory)
llama-rs run model.onnx -p "Hello, world!" -n 50

# With sampling parameters
llama-rs run model.gguf -p "Once upon a time" -n 100 --temperature 0.8 --top-k 40

# Deterministic output (greedy sampling)
llama-rs run model.gguf -p "1+1=" -n 5 --temperature 0
```

### Model Information

```bash
llama-rs info model.gguf
llama-rs info model.onnx
```

## Supported Models

| Model Family | Status | Notes |
|--------------|--------|-------|
| LLaMA/LLaMA2/LLaMA3 || Full support |
| Mistral || Use `[INST]...[/INST]` format |
| Qwen2/Qwen2.5 || Includes attention biases |
| Qwen3 || Dense model with QK norm, partial RoPE |
| Qwen3Moe || MoE with top-k expert routing |
| Qwen3Next || Hybrid attention + DeltaNet recurrent layers |
| Mixtral || MoE with top-2 expert routing |
| TinyLlama || GQA support |
| DeepSeek-Coder || Linear RoPE scaling |
| CodeLlama || LLaMA-based |
| Yi || LLaMA-based |

See [MODEL_COMPATIBILITY.md](docs/MODEL_COMPATIBILITY.md) for detailed compatibility information.

## Quantization Formats

| Format | Bits | Quality | Size (7B) |
|--------|------|---------|-----------|
| Q2_K | 2 | Low | ~2.5 GB |
| Q3_K | 3 | Fair | ~3.0 GB |
| Q4_K_M | 4 | Good | ~4.0 GB |
| Q5_K_M | 5 | Better | ~5.0 GB |
| Q6_K | 6 | High | ~5.5 GB |
| Q8_0 | 8 | Excellent | ~7.0 GB |
| F16 | 16 | Full | ~14 GB |

## Feature Flags

| Feature | Default | Description |
|---------|---------|-------------|
| `cpu` || CPU backend with SIMD (AVX2, AVX-512, NEON) |
| `huggingface` || HuggingFace Hub model downloading |
| `cli` || Command-line interface |
| `client` || HTTP client for remote inference |
| `onnx` || ONNX model loading via HuggingFace Optimum |
| `cuda` | | NVIDIA GPU acceleration via CUDA |
| `metal` | | Apple Silicon GPU acceleration via Metal |
| `dx12` | | Windows GPU acceleration via DirectX 12 |
| `vulkan` | | Cross-platform GPU acceleration via Vulkan |
| `server` | | HTTP server with OpenAI-compatible API |
| `rag` | | RAG with PostgreSQL/pgvector vector store |
| `distributed` | | Pipeline-parallel inference via gRPC |

## GPU Acceleration

### CUDA (NVIDIA GPUs)

```bash
CUDA_PATH=/opt/cuda cargo build --release --features cuda
llama-rs run model.gguf -p "Hello" --gpu
```

Requires NVIDIA GPU with compute capability 6.0+ and CUDA Toolkit 12.0+.

The CUDA backend provides full GPU-resident inference via `GpuOnlyInference`, keeping all weights, KV cache, and intermediate tensors in VRAM. Custom kernels handle quantized dequantization, fused RMS norm, RoPE, DeltaNet, and MoE expert dispatch entirely on GPU.

### Metal (Apple Silicon / macOS)

```bash
cargo build --release --features metal
llama-rs run model.gguf -p "Hello" --gpu
```

Requires macOS with Metal-capable GPU.

### DirectX 12 (Windows)

```bash
cargo build --release --features dx12
llama-rs run model.gguf -p "Hello" --gpu
```

Requires Windows 10+ with a DirectX 12 compatible GPU.

### Vulkan (Cross-platform)

```bash
cargo build --release --features vulkan
llama-rs run model.gguf -p "Hello" --gpu
```

Requires Vulkan SDK and a Vulkan-capable GPU.

**GPU-accelerated operations (all backends):**
- Element-wise: add, mul, scale
- Activations: SiLU, GELU
- Normalization: RMS norm
- Softmax
- RoPE positional embeddings
- Vector-matrix multiplication (f32)

**CUDA-exclusive operations:**
- Quantized dequantization (Q4_K_M, Q6_K, Q8_0, etc.) on GPU
- Fused RMS norm kernels
- DeltaNet recurrent layer kernels
- MoE expert routing and dispatch
- KV cache management on GPU

## RAG (Retrieval-Augmented Generation)

pgvector-backed vector store for retrieval-augmented generation. Enable with `--features rag`.

### Setup

Requires PostgreSQL with the [pgvector](https://github.com/pgvector/pgvector) extension:

```bash
# Docker (quickstart)
docker run -d --name pgvector -p 5432:5432 \
  -e POSTGRES_PASSWORD=password \
  pgvector/pgvector:pg16
```

### Library Usage

```rust
use llama_rs::{RagConfig, RagStore, NewDocument};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = RagConfig::new("postgresql://user:pass@localhost/mydb")
        .with_table_name("documents")
        .with_dimensions(384);

    let store = RagStore::new(config).await?;
    store.create_table().await?;

    // Insert documents
    let doc = NewDocument {
        content: "Rust is a systems programming language.".into(),
        embedding: vec![0.1; 384],
        metadata: Some(serde_json::json!({"topic": "rust"})),
    };
    store.insert(&doc).await?;

    // Semantic search
    let query_embedding = vec![0.1; 384];
    let results = store.search(&query_embedding, 10, None).await?;

    for result in results {
        println!("{}: {}", result.score, result.content);
    }

    Ok(())
}
```

### Features

- **Search modes**: Semantic (vector), keyword (tsvector), and hybrid with Reciprocal Rank Fusion
- **Distance metrics**: Cosine similarity, L2 distance, inner product
- **Indexing**: HNSW and IVFFlat with configurable parameters
- **Metadata filtering**: Eq, In, Range, Contains, and compound AND/OR/NOT filters
- **KnowledgeBase**: High-level API for document ingestion, chunking, and retrieve-and-generate
- **Configuration**: TOML files with environment variable overrides

### CLI

```bash
# Ingest documents
llama-rs rag ingest --config rag.toml --source ./docs/

# Search
llama-rs rag search --config rag.toml --query "How does authentication work?"
```

## Council of Experts

Run a multi-round debate across network-connected agents and synthesize a single final answer. Each round, every expert answers in parallel; refinement rounds feed peer answers (anonymized) back to each expert. The council stops early on convergence (cosine similarity over answer embeddings) or when the round cap is hit. A dedicated synthesizer produces the final response.

Agents may be llama-rs servers over gRPC *or* any OpenAI-compatible HTTP endpoint — the existing `llama-rs serve` command works out of the box as a council agent.

**Enable the feature:**

```toml
llama-rs = { version = "0.15", features = ["council"] }
```

**Example config (`council.toml`):**

```toml
min_rounds = 2
max_rounds = 4
convergence_threshold = 0.92

[embedder]
kind = "local_gguf"
path = "/path/to/all-MiniLM-L6-v2.gguf"

[[agent]]
role = "expert"
endpoint = "http://localhost:8080"
model = "qwen2.5-0.5b"
timeout_ms = 30000

[[agent]]
role = "expert"
endpoint = "http://localhost:8081"
model = "llama-3.2-1b"
timeout_ms = 30000

[[agent]]
role = "synthesizer"
endpoint = "http://localhost:8082"
model = "deepseek-v3"
timeout_ms = 60000
```

**CLI:**

```bash
# Final answer only
llama-rs council --config council.toml -p "Summarize this contract."

# Full structured transcript as NDJSON
llama-rs council --config council.toml -p "..." --transcript | jq .
```

**HTTP (orchestrator mode):** With the council feature enabled, `llama-rs serve` exposes two extra endpoints:

- `POST /v1/council/completions` — returns `{ "answer": "..." }` as JSON
- `POST /v1/council/transcript` — SSE stream of structured events (`expert_token`, `round_completed`, `convergence_check`, `final_token`, ...)

Both accept a body of the form `{ "prompt": "...", "config_toml": "..." }`.

**Known limitation (v1):** llama-rs agents are reached via the OpenAI HTTP fallback, not native gRPC. A dedicated gRPC `Council` server lands in a follow-up — the orchestrator side is already in place.

## ONNX Support

llama-rs can load models exported to ONNX format via [HuggingFace Optimum](https://huggingface.co/docs/optimum/). ONNX support is enabled by default.

**Supported formats:**
- F32, F16, and BF16 weight tensors (F16/BF16 auto-converted to F32)
- External data files (`.onnx_data`) for large models
- Graph-traced tensor name resolution for Optimum exports

**Requirements:**

An ONNX model directory must contain:
- `model.onnx` — the model graph and weights
- `config.json` — HuggingFace model configuration
- `tokenizer.json` — HuggingFace tokenizer

**Exporting a model to ONNX:**

```bash
pip install optimum[onnxruntime]
optimum-cli export onnx --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 ./tinyllama-onnx/
```

```bash
llama-rs run ./tinyllama-onnx/model.onnx -p "Hello!" -n 50
```

## Library Usage

```rust
use llama_rs::{
    backend::cpu::CpuBackend,
    gguf::GgufFile,
    model::{load_llama_model, InferenceContext},
    sampling::Sampler,
    tokenizer::Tokenizer,
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load model
    let model = load_llama_model("model.gguf")?;
    let gguf = GgufFile::open("model.gguf")?;
    let tokenizer = Tokenizer::from_gguf(&gguf)?;
    
    // Setup inference
    let backend = CpuBackend::new();
    let mut ctx = InferenceContext::new(model.config(), Box::new(backend));
    let sampler = Sampler::new(0.8, 40, 0.9); // temperature, top_k, top_p
    
    // Encode prompt
    let tokens = tokenizer.encode("Hello, world!", true)?;
    
    // Generate
    let mut output_tokens = tokens.clone();
    for _ in 0..50 {
        let logits = model.forward(&output_tokens[output_tokens.len()-1..], &mut ctx)?;
        let next_token = sampler.sample(&logits, &output_tokens);
        output_tokens.push(next_token);
        
        // Decode and print
        if let Ok(text) = tokenizer.decode(&[next_token]) {
            print!("{}", text);
        }
    }
    
    Ok(())
}
```

## CLI Reference

```
llama-rs <COMMAND>

Commands:
  info         Display model information
  run          Run inference on a model
  chat         Interactive chat mode
  serve        Start HTTP server (with --features server)
  quantize     Quantize a model
  bench        Benchmark model performance
  embed        Extract embeddings
  download     Download a model from HuggingFace Hub
  models       Manage cached models
  rag          RAG operations (with --features rag)
  init-config  Generate example config file
  manpages     Generate and install man pages
  help         Print help

Run Options:
  -p, --prompt <PROMPT>      Input prompt
  -n, --max-tokens <N>       Maximum tokens to generate [default: 128]
  -t, --temperature <T>      Sampling temperature [default: 0.8]
  -k, --top-k <K>            Top-k sampling [default: 40]
      --top-p <P>            Top-p (nucleus) sampling [default: 0.9]
      --repeat-penalty <R>   Repetition penalty [default: 1.1]
  -s, --seed <SEED>          Random seed for reproducibility
      --gpu                  Use GPU acceleration (requires GPU feature)
```

## Performance

Benchmarked on Intel i9-13900K (24 cores, AVX2) with 64GB RAM:

| Model | Quantization | Tokens/sec | Notes |
|-------|--------------|------------|-------|
| Qwen2.5-0.5B | Q4_K_M | ~1.2 t/s | 896 hidden dim |
| TinyLlama-1.1B | Q4_K_M | ~1.5 t/s | 2048 hidden dim |
| Mistral-7B | Q4_K_M | ~0.3 t/s | 4096 hidden dim |

*Current implementation prioritizes correctness over speed. Performance optimizations (batch processing, better SIMD utilization) are planned.*

Performance varies by hardware, model size, context length, and quantization.

## Contributing

Contributions are welcome! Please see [AGENTS.md](AGENTS.md) for development guidelines.

## License

Licensed under either of:

- Apache License, Version 2.0 ([LICENSE-APACHE]LICENSE-APACHE)
- MIT License ([LICENSE-MIT]LICENSE-MIT)

at your option.

## Acknowledgments

- [llama.cpp]https://github.com/ggerganov/llama.cpp - The original implementation
- [GGML]https://github.com/ggerganov/ggml - Tensor library and GGUF format
- [pgvector]https://github.com/pgvector/pgvector - PostgreSQL vector similarity search

---

**Lexmata LLC** - [jquinn@lexmata.ai](mailto:jquinn@lexmata.ai)