llama-gguf
A high-performance Rust implementation of llama.cpp - an LLM inference engine with full GGUF and ONNX support.
Features
- Full GGUF Support - Load any GGUF model file compatible with llama.cpp
- ONNX Support - Load HuggingFace Optimum ONNX exports (F32, F16, BF16 with auto-conversion)
- Multiple Architectures - LLaMA, Mistral, Qwen2, Qwen3/Qwen3Next, Mixtral, TinyLlama, DeepSeek, and more
- Quantization - All K-quant formats (Q2_K through Q8_0) plus F16/F32
- HuggingFace Integration - Download models directly from HuggingFace Hub
- Fast CPU Inference - SIMD-optimized (AVX2, AVX-512, NEON)
- GPU Inference - Full GPU-resident inference on CUDA; Metal, DX12, Vulkan via Backend trait
- Mixture of Experts - MoE support with top-k routing (Mixtral, Qwen3Moe, DeepSeek)
- DeltaNet/SSM - Gated DeltaNet recurrent layers for hybrid attention/SSM models (Qwen3Next)
- Distributed Inference - Pipeline-parallel inference across multiple nodes via gRPC
- RAG - Retrieval-Augmented Generation with PostgreSQL/pgvector vector store
- OpenAI-compatible API - HTTP server with streaming support
- Grouped Query Attention - Efficient KV cache for GQA models
- Streaming Output - Token-by-token generation
Installation
From crates.io
From Source
The binary will be at target/release/llama-gguf.
System Installation with Man Pages
Option 1: Using cargo install (generates man pages from CLI)
# Generate and install man pages
# Or system-wide (requires sudo)
Option 2: Using make (includes detailed hand-written man pages)
# Build and install to /usr/local (requires sudo)
# Or install to a custom prefix
# Install man pages only
After installation, access documentation with:
As a Library
[]
= "0.10"
Quick Start
Download a Model
# List available files in a repository
# Download a specific quantized model
Run Inference
# Basic text generation (GGUF)
# ONNX model (requires config.json and tokenizer.json in same directory)
# With sampling parameters
# Deterministic output (greedy sampling)
Model Information
Supported Models
| Model Family | Status | Notes |
|---|---|---|
| LLaMA/LLaMA2/LLaMA3 | ✅ | Full support |
| Mistral | ✅ | Use [INST]...[/INST] format |
| Qwen2/Qwen2.5 | ✅ | Includes attention biases |
| Qwen3 | ✅ | Dense model with QK norm, partial RoPE |
| Qwen3Moe | ✅ | MoE with top-k expert routing |
| Qwen3Next | ✅ | Hybrid attention + DeltaNet recurrent layers |
| Mixtral | ✅ | MoE with top-2 expert routing |
| TinyLlama | ✅ | GQA support |
| DeepSeek-Coder | ✅ | Linear RoPE scaling |
| CodeLlama | ✅ | LLaMA-based |
| Yi | ✅ | LLaMA-based |
See MODEL_COMPATIBILITY.md for detailed compatibility information.
Quantization Formats
| Format | Bits | Quality | Size (7B) |
|---|---|---|---|
| Q2_K | 2 | Low | ~2.5 GB |
| Q3_K | 3 | Fair | ~3.0 GB |
| Q4_K_M | 4 | Good | ~4.0 GB |
| Q5_K_M | 5 | Better | ~5.0 GB |
| Q6_K | 6 | High | ~5.5 GB |
| Q8_0 | 8 | Excellent | ~7.0 GB |
| F16 | 16 | Full | ~14 GB |
Feature Flags
| Feature | Default | Description |
|---|---|---|
cpu |
✅ | CPU backend with SIMD (AVX2, AVX-512, NEON) |
huggingface |
✅ | HuggingFace Hub model downloading |
cli |
✅ | Command-line interface |
client |
✅ | HTTP client for remote inference |
onnx |
✅ | ONNX model loading via HuggingFace Optimum |
cuda |
NVIDIA GPU acceleration via CUDA | |
metal |
Apple Silicon GPU acceleration via Metal | |
dx12 |
Windows GPU acceleration via DirectX 12 | |
vulkan |
Cross-platform GPU acceleration via Vulkan | |
server |
HTTP server with OpenAI-compatible API | |
rag |
RAG with PostgreSQL/pgvector vector store | |
distributed |
Pipeline-parallel inference via gRPC |
GPU Acceleration
CUDA (NVIDIA GPUs)
CUDA_PATH=/opt/cuda
Requires NVIDIA GPU with compute capability 6.0+ and CUDA Toolkit 12.0+.
The CUDA backend provides full GPU-resident inference via GpuOnlyInference, keeping all weights, KV cache, and intermediate tensors in VRAM. Custom kernels handle quantized dequantization, fused RMS norm, RoPE, DeltaNet, and MoE expert dispatch entirely on GPU.
Metal (Apple Silicon / macOS)
Requires macOS with Metal-capable GPU.
DirectX 12 (Windows)
Requires Windows 10+ with a DirectX 12 compatible GPU.
Vulkan (Cross-platform)
Requires Vulkan SDK and a Vulkan-capable GPU.
GPU-accelerated operations (all backends):
- Element-wise: add, mul, scale
- Activations: SiLU, GELU
- Normalization: RMS norm
- Softmax
- RoPE positional embeddings
- Vector-matrix multiplication (f32)
CUDA-exclusive operations:
- Quantized dequantization (Q4_K_M, Q6_K, Q8_0, etc.) on GPU
- Fused RMS norm kernels
- DeltaNet recurrent layer kernels
- MoE expert routing and dispatch
- KV cache management on GPU
RAG (Retrieval-Augmented Generation)
pgvector-backed vector store for retrieval-augmented generation. Enable with --features rag.
Setup
Requires PostgreSQL with the pgvector extension:
# Docker (quickstart)
Library Usage
use ;
async
Features
- Search modes: Semantic (vector), keyword (tsvector), and hybrid with Reciprocal Rank Fusion
- Distance metrics: Cosine similarity, L2 distance, inner product
- Indexing: HNSW and IVFFlat with configurable parameters
- Metadata filtering: Eq, In, Range, Contains, and compound AND/OR/NOT filters
- KnowledgeBase: High-level API for document ingestion, chunking, and retrieve-and-generate
- Configuration: TOML files with environment variable overrides
CLI
# Ingest documents
# Search
ONNX Support
llama-gguf can load models exported to ONNX format via HuggingFace Optimum. ONNX support is enabled by default.
Supported formats:
- F32, F16, and BF16 weight tensors (F16/BF16 auto-converted to F32)
- External data files (
.onnx_data) for large models - Graph-traced tensor name resolution for Optimum exports
Requirements:
An ONNX model directory must contain:
model.onnx— the model graph and weightsconfig.json— HuggingFace model configurationtokenizer.json— HuggingFace tokenizer
Exporting a model to ONNX:
Library Usage
use ;
CLI Reference
llama-gguf <COMMAND>
Commands:
info Display model information
run Run inference on a model
chat Interactive chat mode
serve Start HTTP server (with --features server)
quantize Quantize a model
bench Benchmark model performance
embed Extract embeddings
download Download a model from HuggingFace Hub
models Manage cached models
rag RAG operations (with --features rag)
init-config Generate example config file
manpages Generate and install man pages
help Print help
Run Options:
-p, --prompt <PROMPT> Input prompt
-n, --max-tokens <N> Maximum tokens to generate [default: 128]
-t, --temperature <T> Sampling temperature [default: 0.8]
-k, --top-k <K> Top-k sampling [default: 40]
--top-p <P> Top-p (nucleus) sampling [default: 0.9]
--repeat-penalty <R> Repetition penalty [default: 1.1]
-s, --seed <SEED> Random seed for reproducibility
--gpu Use GPU acceleration (requires GPU feature)
Performance
Benchmarked on Intel i9-13900K (24 cores, AVX2) with 64GB RAM:
| Model | Quantization | Tokens/sec | Notes |
|---|---|---|---|
| Qwen2.5-0.5B | Q4_K_M | ~1.2 t/s | 896 hidden dim |
| TinyLlama-1.1B | Q4_K_M | ~1.5 t/s | 2048 hidden dim |
| Mistral-7B | Q4_K_M | ~0.3 t/s | 4096 hidden dim |
Current implementation prioritizes correctness over speed. Performance optimizations (batch processing, better SIMD utilization) are planned.
Performance varies by hardware, model size, context length, and quantization.
Contributing
Contributions are welcome! Please see AGENTS.md for development guidelines.
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT License (LICENSE-MIT)
at your option.
Acknowledgments
- llama.cpp - The original implementation
- GGML - Tensor library and GGUF format
- pgvector - PostgreSQL vector similarity search
Lexmata LLC - jquinn@lexmata.ai