llama-rs 0.17.0

# llama-rs

A high-performance Rust implementation of [llama.cpp](https://github.com/ggerganov/llama.cpp) - an LLM inference engine with full GGUF and ONNX support.

[![Crates.io](https://img.shields.io/crates/v/llama-rs.svg)](https://crates.io/crates/llama-rs)
[![License](https://img.shields.io/crates/l/llama-rs.svg)](LICENSE-MIT)

## Features

- **Full GGUF Support** - Load any GGUF model file compatible with llama.cpp
- **ONNX Support** - Load HuggingFace Optimum ONNX exports (F32, F16, BF16 with auto-conversion)
- **Multiple Architectures** - LLaMA, Mistral, Qwen2, Qwen3/Qwen3Next, Mixtral, TinyLlama, DeepSeek, and more
- **Quantization** - All K-quant formats (Q2_K through Q8_0) plus F16/F32
- **HuggingFace Integration** - Download models directly from HuggingFace Hub
- **Fast CPU Inference** - SIMD-optimized (AVX2, AVX-512, NEON)
- **GPU Inference** - Full GPU-resident inference on CUDA; Metal, DX12, Vulkan via Backend trait
- **Mixture of Experts** - MoE support with top-k routing (Mixtral, Qwen3Moe, DeepSeek)
- **DeltaNet/SSM** - Gated DeltaNet recurrent layers for hybrid attention/SSM models (Qwen3Next)
- **Distributed Inference** - Pipeline-parallel inference across multiple nodes via gRPC
- **RAG** - Retrieval-Augmented Generation with PostgreSQL/pgvector vector store
- **OpenAI-compatible API** - HTTP server with streaming support
- **Grouped Query Attention** - Efficient KV cache for GQA models
- **Streaming Output** - Token-by-token generation

## Installation

### From crates.io

```bash
cargo install llama-rs
```

### From Source

```bash
git clone https://github.com/Lexmata/llama-rs.git
cd llama-rs
cargo build --release
```

The binary will be at `target/release/llama-rs`.

### System Installation with Man Pages

**Option 1: Using cargo install (generates man pages from CLI)**

```bash
cargo install llama-rs

# Generate and install man pages
llama-rs manpages ~/.local/share/man/man1
mandb -u

# Or system-wide (requires sudo)
sudo llama-rs manpages /usr/local/share/man/man1
sudo mandb
```

**Option 2: Using make (includes detailed hand-written man pages)**

```bash
git clone https://github.com/Lexmata/llama-rs.git
cd llama-rs

# Build and install to /usr/local (requires sudo)
sudo make install

# Or install to a custom prefix
make PREFIX=~/.local install

# Install man pages only
sudo make install-man
```

After installation, access documentation with:

```bash
man llama-rs           # Main command overview
man llama-rs-run       # Run inference
man llama-rs-chat      # Interactive chat
man llama-rs-serve     # HTTP server
man llama-rs-rag       # RAG operations
```

### As a Library

```toml
[dependencies]
llama-rs = "0.10"
```

## Quick Start

### Download a Model

```bash
# List available files in a repository
llama-rs download Qwen/Qwen2.5-0.5B-Instruct-GGUF

# Download a specific quantized model
llama-rs download Qwen/Qwen2.5-0.5B-Instruct-GGUF -f qwen2.5-0.5b-instruct-q4_k_m.gguf
```

### Run Inference

```bash
# Basic text generation (GGUF)
llama-rs run model.gguf -p "Hello, world!" -n 50

# ONNX model (requires config.json and tokenizer.json in same directory)
llama-rs run model.onnx -p "Hello, world!" -n 50

# With sampling parameters
llama-rs run model.gguf -p "Once upon a time" -n 100 --temperature 0.8 --top-k 40

# Deterministic output (greedy sampling)
llama-rs run model.gguf -p "1+1=" -n 5 --temperature 0
```

### Model Information

```bash
llama-rs info model.gguf
llama-rs info model.onnx
```

## Supported Models

| Model Family | Status | Notes |
|--------------|--------|-------|
| LLaMA/LLaMA2/LLaMA3 | ✅ | Full support |
| Mistral | ✅ | Use `[INST]...[/INST]` format |
| Qwen2/Qwen2.5 | ✅ | Includes attention biases |
| Qwen3 | ✅ | Dense model with QK norm, partial RoPE |
| Qwen3Moe | ✅ | MoE with top-k expert routing |
| Qwen3Next | ✅ | Hybrid attention + DeltaNet recurrent layers |
| Mixtral | ✅ | MoE with top-2 expert routing |
| TinyLlama | ✅ | GQA support |
| DeepSeek-Coder | ✅ | Linear RoPE scaling |
| CodeLlama | ✅ | LLaMA-based |
| Yi | ✅ | LLaMA-based |

See [MODEL_COMPATIBILITY.md](docs/MODEL_COMPATIBILITY.md) for detailed compatibility information.

## Quantization Formats

| Format | Bits | Quality | Size (7B) |
|--------|------|---------|-----------|
| Q2_K | 2 | Low | ~2.5 GB |
| Q3_K | 3 | Fair | ~3.0 GB |
| Q4_K_M | 4 | Good | ~4.0 GB |
| Q5_K_M | 5 | Better | ~5.0 GB |
| Q6_K | 6 | High | ~5.5 GB |
| Q8_0 | 8 | Excellent | ~7.0 GB |
| F16 | 16 | Full | ~14 GB |

## Feature Flags

| Feature | Default | Description |
|---------|---------|-------------|
| `cpu` | ✅ | CPU backend with SIMD (AVX2, AVX-512, NEON) |
| `huggingface` | ✅ | HuggingFace Hub model downloading |
| `cli` | ✅ | Command-line interface |
| `client` | ✅ | HTTP client for remote inference |
| `onnx` | ✅ | ONNX model loading via HuggingFace Optimum |
| `cuda` | | NVIDIA GPU acceleration via CUDA |
| `metal` | | Apple Silicon GPU acceleration via Metal |
| `dx12` | | Windows GPU acceleration via DirectX 12 |
| `vulkan` | | Cross-platform GPU acceleration via Vulkan |
| `server` | | HTTP server with OpenAI-compatible API |
| `rag` | | RAG with PostgreSQL/pgvector vector store |
| `distributed` | | Pipeline-parallel inference via gRPC |

## GPU Acceleration

### CUDA (NVIDIA GPUs)

```bash
CUDA_PATH=/opt/cuda cargo build --release --features cuda
llama-rs run model.gguf -p "Hello" --gpu
```

Requires NVIDIA GPU with compute capability 6.0+ and CUDA Toolkit 12.0+.

The CUDA backend provides full GPU-resident inference via `GpuOnlyInference`, keeping all weights, KV cache, and intermediate tensors in VRAM. Custom kernels handle quantized dequantization, fused RMS norm, RoPE, DeltaNet, and MoE expert dispatch entirely on GPU.

### Metal (Apple Silicon / macOS)

```bash
cargo build --release --features metal
llama-rs run model.gguf -p "Hello" --gpu
```

Requires macOS with Metal-capable GPU.

### DirectX 12 (Windows)

```bash
cargo build --release --features dx12
llama-rs run model.gguf -p "Hello" --gpu
```

Requires Windows 10+ with a DirectX 12 compatible GPU.

### Vulkan (Cross-platform)

```bash
cargo build --release --features vulkan
llama-rs run model.gguf -p "Hello" --gpu
```

Requires Vulkan SDK and a Vulkan-capable GPU.

**GPU-accelerated operations (all backends):**
- Element-wise: add, mul, scale
- Activations: SiLU, GELU
- Normalization: RMS norm
- Softmax
- RoPE positional embeddings
- Vector-matrix multiplication (f32)

**CUDA-exclusive operations:**
- Quantized dequantization (Q4_K_M, Q6_K, Q8_0, etc.) on GPU
- Fused RMS norm kernels
- DeltaNet recurrent layer kernels
- MoE expert routing and dispatch
- KV cache management on GPU

## RAG (Retrieval-Augmented Generation)

pgvector-backed vector store for retrieval-augmented generation. Enable with `--features rag`.

### Setup

Requires PostgreSQL with the [pgvector](https://github.com/pgvector/pgvector) extension:

```bash
# Docker (quickstart)
docker run -d --name pgvector -p 5432:5432 \
  -e POSTGRES_PASSWORD=password \
  pgvector/pgvector:pg16
```

### Library Usage

```rust
use llama_rs::{RagConfig, RagStore, NewDocument};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = RagConfig::new("postgresql://user:pass@localhost/mydb")
        .with_table_name("documents")
        .with_dimensions(384);

    let store = RagStore::new(config).await?;
    store.create_table().await?;

    // Insert documents
    let doc = NewDocument {
        content: "Rust is a systems programming language.".into(),
        embedding: vec![0.1; 384],
        metadata: Some(serde_json::json!({"topic": "rust"})),
    };
    store.insert(&doc).await?;

    // Semantic search
    let query_embedding = vec![0.1; 384];
    let results = store.search(&query_embedding, 10, None).await?;

    for result in results {
        println!("{}: {}", result.score, result.content);
    }

    Ok(())
}
```

### Features

- **Search modes**: Semantic (vector), keyword (tsvector), and hybrid with Reciprocal Rank Fusion
- **Distance metrics**: Cosine similarity, L2 distance, inner product
- **Indexing**: HNSW and IVFFlat with configurable parameters
- **Metadata filtering**: Eq, In, Range, Contains, and compound AND/OR/NOT filters
- **KnowledgeBase**: High-level API for document ingestion, chunking, and retrieve-and-generate
- **Configuration**: TOML files with environment variable overrides

### CLI

```bash
# Ingest documents
llama-rs rag ingest --config rag.toml --source ./docs/

# Search
llama-rs rag search --config rag.toml --query "How does authentication work?"
```

## Council of Experts

Run a multi-round debate across network-connected agents and synthesize a single final answer. Each round, every expert answers in parallel; refinement rounds feed peer answers (anonymized) back to each expert. The council stops early on convergence (cosine similarity over answer embeddings) or when the round cap is hit. A dedicated synthesizer produces the final response.

Agents may be llama-rs servers over gRPC *or* any OpenAI-compatible HTTP endpoint — the existing `llama-rs serve` command works out of the box as a council agent.

**Enable the feature:**

```toml
llama-rs = { version = "0.15", features = ["council"] }
```

**Example config (`council.toml`):**

```toml
min_rounds = 2
max_rounds = 4
convergence_threshold = 0.92

[embedder]
kind = "local_gguf"
path = "/path/to/all-MiniLM-L6-v2.gguf"

[[agent]]
role = "expert"
endpoint = "http://localhost:8080"
model = "qwen2.5-0.5b"
timeout_ms = 30000

[[agent]]
role = "expert"
endpoint = "http://localhost:8081"
model = "llama-3.2-1b"
timeout_ms = 30000

[[agent]]
role = "synthesizer"
endpoint = "http://localhost:8082"
model = "deepseek-v3"
timeout_ms = 60000
```

**CLI:**

```bash
# Final answer only
llama-rs council --config council.toml -p "Summarize this contract."

# Full structured transcript as NDJSON
llama-rs council --config council.toml -p "..." --transcript | jq .
```

**HTTP (orchestrator mode):** With the council feature enabled, `llama-rs serve` exposes two extra endpoints:

- `POST /v1/council/completions` — returns `{ "answer": "..." }` as JSON
- `POST /v1/council/transcript` — SSE stream of structured events (`expert_token`, `round_completed`, `convergence_check`, `final_token`, ...)

Both accept a body of the form `{ "prompt": "...", "config_toml": "..." }`.

**Known limitation (v1):** llama-rs agents are reached via the OpenAI HTTP fallback, not native gRPC. A dedicated gRPC `Council` server lands in a follow-up — the orchestrator side is already in place.

## ONNX Support

llama-rs can load models exported to ONNX format via [HuggingFace Optimum](https://huggingface.co/docs/optimum/). ONNX support is enabled by default.

**Supported formats:**
- F32, F16, and BF16 weight tensors (F16/BF16 auto-converted to F32)
- External data files (`.onnx_data`) for large models
- Graph-traced tensor name resolution for Optimum exports

**Requirements:**

An ONNX model directory must contain:
- `model.onnx` — the model graph and weights
- `config.json` — HuggingFace model configuration
- `tokenizer.json` — HuggingFace tokenizer

**Exporting a model to ONNX:**

```bash
pip install optimum[onnxruntime]
optimum-cli export onnx --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 ./tinyllama-onnx/
```

```bash
llama-rs run ./tinyllama-onnx/model.onnx -p "Hello!" -n 50
```

## Library Usage

```rust
use llama_rs::{
    backend::cpu::CpuBackend,
    gguf::GgufFile,
    model::{load_llama_model, InferenceContext},
    sampling::Sampler,
    tokenizer::Tokenizer,
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load model
    let model = load_llama_model("model.gguf")?;
    let gguf = GgufFile::open("model.gguf")?;
    let tokenizer = Tokenizer::from_gguf(&gguf)?;
    
    // Setup inference
    let backend = CpuBackend::new();
    let mut ctx = InferenceContext::new(model.config(), Box::new(backend));
    let sampler = Sampler::new(0.8, 40, 0.9); // temperature, top_k, top_p
    
    // Encode prompt
    let tokens = tokenizer.encode("Hello, world!", true)?;
    
    // Generate
    let mut output_tokens = tokens.clone();
    for _ in 0..50 {
        let logits = model.forward(&output_tokens[output_tokens.len()-1..], &mut ctx)?;
        let next_token = sampler.sample(&logits, &output_tokens);
        output_tokens.push(next_token);
        
        // Decode and print
        if let Ok(text) = tokenizer.decode(&[next_token]) {
            print!("{}", text);
        }
    }
    
    Ok(())
}
```

## CLI Reference

```
llama-rs <COMMAND>

Commands:
  info         Display model information
  run          Run inference on a model
  chat         Interactive chat mode
  serve        Start HTTP server (with --features server)
  quantize     Quantize a model
  bench        Benchmark model performance
  embed        Extract embeddings
  download     Download a model from HuggingFace Hub
  models       Manage cached models
  rag          RAG operations (with --features rag)
  init-config  Generate example config file
  manpages     Generate and install man pages
  help         Print help

Run Options:
  -p, --prompt <PROMPT>      Input prompt
  -n, --max-tokens <N>       Maximum tokens to generate [default: 128]
  -t, --temperature <T>      Sampling temperature [default: 0.8]
  -k, --top-k <K>            Top-k sampling [default: 40]
      --top-p <P>            Top-p (nucleus) sampling [default: 0.9]
      --repeat-penalty <R>   Repetition penalty [default: 1.1]
  -s, --seed <SEED>          Random seed for reproducibility
      --gpu                  Use GPU acceleration (requires GPU feature)
```

## Performance

Benchmarked on Intel i9-13900K (24 cores, AVX2) with 64GB RAM:

| Model | Quantization | Tokens/sec | Notes |
|-------|--------------|------------|-------|
| Qwen2.5-0.5B | Q4_K_M | ~1.2 t/s | 896 hidden dim |
| TinyLlama-1.1B | Q4_K_M | ~1.5 t/s | 2048 hidden dim |
| Mistral-7B | Q4_K_M | ~0.3 t/s | 4096 hidden dim |

*Current implementation prioritizes correctness over speed. Performance optimizations (batch processing, better SIMD utilization) are planned.*

Performance varies by hardware, model size, context length, and quantization.

## Contributing

Contributions are welcome! Please see [AGENTS.md](AGENTS.md) for development guidelines.

## License

Licensed under either of:

- Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE))
- MIT License ([LICENSE-MIT](LICENSE-MIT))

at your option.

## Acknowledgments

- [llama.cpp](https://github.com/ggerganov/llama.cpp) - The original implementation
- [GGML](https://github.com/ggerganov/ggml) - Tensor library and GGUF format
- [pgvector](https://github.com/pgvector/pgvector) - PostgreSQL vector similarity search

---

**Lexmata LLC** - [jquinn@lexmata.ai](mailto:jquinn@lexmata.ai)