llama-gguf

A high-performance Rust implementation of llama.cpp - an LLM inference engine with full GGUF and ONNX support.

Features

Full GGUF Support - Load any GGUF model file compatible with llama.cpp
ONNX Support - Load HuggingFace Optimum ONNX exports (F32, F16, BF16 with auto-conversion)
Multiple Architectures - LLaMA, Mistral, Qwen2, Qwen3/Qwen3Next, Mixtral, TinyLlama, DeepSeek, and more
Quantization - All K-quant formats (Q2_K through Q8_0) plus F16/F32
HuggingFace Integration - Download models directly from HuggingFace Hub
Fast CPU Inference - SIMD-optimized (AVX2, AVX-512, NEON)
GPU Inference - Full GPU-resident inference on CUDA; Metal, DX12, Vulkan via Backend trait
Mixture of Experts - MoE support with top-k routing (Mixtral, Qwen3Moe, DeepSeek)
DeltaNet/SSM - Gated DeltaNet recurrent layers for hybrid attention/SSM models (Qwen3Next)
Distributed Inference - Pipeline-parallel inference across multiple nodes via gRPC
RAG - Retrieval-Augmented Generation with PostgreSQL/pgvector vector store
OpenAI-compatible API - HTTP server with streaming support
Grouped Query Attention - Efficient KV cache for GQA models
Streaming Output - Token-by-token generation

Installation

From crates.io

cargo install llama-gguf

From Source

git clone https://github.com/Lexmata/llama-gguf.git
cd llama-gguf
cargo build --release

The binary will be at target/release/llama-gguf.

System Installation with Man Pages

Option 1: Using cargo install (generates man pages from CLI)

cargo install llama-gguf

# Generate and install man pages
llama-gguf manpages ~/.local/share/man/man1
mandb -u

# Or system-wide (requires sudo)
sudo llama-gguf manpages /usr/local/share/man/man1
sudo mandb

Option 2: Using make (includes detailed hand-written man pages)

git clone https://github.com/Lexmata/llama-gguf.git
cd llama-gguf

# Build and install to /usr/local (requires sudo)
sudo make install

# Or install to a custom prefix
make PREFIX=~/.local install

# Install man pages only
sudo make install-man

After installation, access documentation with:

man llama-gguf           # Main command overview
man llama-gguf-run       # Run inference
man llama-gguf-chat      # Interactive chat
man llama-gguf-serve     # HTTP server
man llama-gguf-rag       # RAG operations

As a Library

[dependencies]
llama-gguf = "0.10"

Quick Start

Download a Model

# List available files in a repository
llama-gguf download Qwen/Qwen2.5-0.5B-Instruct-GGUF

# Download a specific quantized model
llama-gguf download Qwen/Qwen2.5-0.5B-Instruct-GGUF -f qwen2.5-0.5b-instruct-q4_k_m.gguf

Run Inference

# Basic text generation (GGUF)
llama-gguf run model.gguf -p "Hello, world!" -n 50

# ONNX model (requires config.json and tokenizer.json in same directory)
llama-gguf run model.onnx -p "Hello, world!" -n 50

# With sampling parameters
llama-gguf run model.gguf -p "Once upon a time" -n 100 --temperature 0.8 --top-k 40

# Deterministic output (greedy sampling)
llama-gguf run model.gguf -p "1+1=" -n 5 --temperature 0

Model Information

llama-gguf info model.gguf
llama-gguf info model.onnx

Supported Models

Model Family	Status	Notes
LLaMA/LLaMA2/LLaMA3	✅	Full support
Mistral	✅	Use `[INST]...[/INST]` format
Qwen2/Qwen2.5	✅	Includes attention biases
Qwen3	✅	Dense model with QK norm, partial RoPE
Qwen3Moe	✅	MoE with top-k expert routing
Qwen3Next	✅	Hybrid attention + DeltaNet recurrent layers
Mixtral	✅	MoE with top-2 expert routing
TinyLlama	✅	GQA support
DeepSeek-Coder	✅	Linear RoPE scaling
CodeLlama	✅	LLaMA-based
Yi	✅	LLaMA-based

See MODEL_COMPATIBILITY.md for detailed compatibility information.

Quantization Formats

Format	Bits	Quality	Size (7B)
Q2_K	2	Low	~2.5 GB
Q3_K	3	Fair	~3.0 GB
Q4_K_M	4	Good	~4.0 GB
Q5_K_M	5	Better	~5.0 GB
Q6_K	6	High	~5.5 GB
Q8_0	8	Excellent	~7.0 GB
F16	16	Full	~14 GB

Feature Flags

Feature	Default	Description
`cpu`	✅	CPU backend with SIMD (AVX2, AVX-512, NEON)
`huggingface`	✅	HuggingFace Hub model downloading
`cli`	✅	Command-line interface
`client`	✅	HTTP client for remote inference
`onnx`	✅	ONNX model loading via HuggingFace Optimum
`cuda`		NVIDIA GPU acceleration via CUDA
`metal`		Apple Silicon GPU acceleration via Metal
`dx12`		Windows GPU acceleration via DirectX 12
`vulkan`		Cross-platform GPU acceleration via Vulkan
`server`		HTTP server with OpenAI-compatible API
`rag`		RAG with PostgreSQL/pgvector vector store
`distributed`		Pipeline-parallel inference via gRPC

GPU Acceleration

CUDA (NVIDIA GPUs)

CUDA_PATH=/opt/cuda cargo build --release --features cuda
llama-gguf run model.gguf -p "Hello" --gpu

Requires NVIDIA GPU with compute capability 6.0+ and CUDA Toolkit 12.0+.

The CUDA backend provides full GPU-resident inference via GpuOnlyInference, keeping all weights, KV cache, and intermediate tensors in VRAM. Custom kernels handle quantized dequantization, fused RMS norm, RoPE, DeltaNet, and MoE expert dispatch entirely on GPU.

Metal (Apple Silicon / macOS)

cargo build --release --features metal
llama-gguf run model.gguf -p "Hello" --gpu

Requires macOS with Metal-capable GPU.

DirectX 12 (Windows)

cargo build --release --features dx12
llama-gguf run model.gguf -p "Hello" --gpu

Requires Windows 10+ with a DirectX 12 compatible GPU.

Vulkan (Cross-platform)

cargo build --release --features vulkan
llama-gguf run model.gguf -p "Hello" --gpu

Requires Vulkan SDK and a Vulkan-capable GPU.

GPU-accelerated operations (all backends):

Element-wise: add, mul, scale
Activations: SiLU, GELU
Normalization: RMS norm
Softmax
RoPE positional embeddings
Vector-matrix multiplication (f32)

CUDA-exclusive operations:

Quantized dequantization (Q4_K_M, Q6_K, Q8_0, etc.) on GPU
Fused RMS norm kernels
DeltaNet recurrent layer kernels
MoE expert routing and dispatch
KV cache management on GPU

RAG (Retrieval-Augmented Generation)

pgvector-backed vector store for retrieval-augmented generation. Enable with --features rag.

Setup

Requires PostgreSQL with the pgvector extension:

# Docker (quickstart)
docker run -d --name pgvector -p 5432:5432 \
  -e POSTGRES_PASSWORD=password \
  pgvector/pgvector:pg16

Library Usage

use llama_gguf::{RagConfig, RagStore, NewDocument};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = RagConfig::new("postgresql://user:pass@localhost/mydb")
        .with_table_name("documents")
        .with_dimensions(384);

    let store = RagStore::new(config).await?;
    store.create_table().await?;

    // Insert documents
    let doc = NewDocument {
        content: "Rust is a systems programming language.".into(),
        embedding: vec![0.1; 384],
        metadata: Some(serde_json::json!({"topic": "rust"})),
    };
    store.insert(&doc).await?;

    // Semantic search
    let query_embedding = vec![0.1; 384];
    let results = store.search(&query_embedding, 10, None).await?;

    for result in results {
        println!("{}: {}", result.score, result.content);
    }

    Ok(())
}

Features

Search modes: Semantic (vector), keyword (tsvector), and hybrid with Reciprocal Rank Fusion
Distance metrics: Cosine similarity, L2 distance, inner product
Indexing: HNSW and IVFFlat with configurable parameters
Metadata filtering: Eq, In, Range, Contains, and compound AND/OR/NOT filters
KnowledgeBase: High-level API for document ingestion, chunking, and retrieve-and-generate
Configuration: TOML files with environment variable overrides

CLI

# Ingest documents
llama-gguf rag ingest --config rag.toml --source ./docs/

# Search
llama-gguf rag search --config rag.toml --query "How does authentication work?"

ONNX Support

llama-gguf can load models exported to ONNX format via HuggingFace Optimum. ONNX support is enabled by default.

Supported formats:

F32, F16, and BF16 weight tensors (F16/BF16 auto-converted to F32)
External data files (.onnx_data) for large models
Graph-traced tensor name resolution for Optimum exports

Requirements:

An ONNX model directory must contain:

model.onnx — the model graph and weights
config.json — HuggingFace model configuration
tokenizer.json — HuggingFace tokenizer

Exporting a model to ONNX:

pip install optimum[onnxruntime]
optimum-cli export onnx --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 ./tinyllama-onnx/

llama-gguf run ./tinyllama-onnx/model.onnx -p "Hello!" -n 50

Library Usage

use llama_gguf::{
    backend::cpu::CpuBackend,
    gguf::GgufFile,
    model::{load_llama_model, InferenceContext},
    sampling::Sampler,
    tokenizer::Tokenizer,
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load model
    let model = load_llama_model("model.gguf")?;
    let gguf = GgufFile::open("model.gguf")?;
    let tokenizer = Tokenizer::from_gguf(&gguf)?;
    
    // Setup inference
    let backend = CpuBackend::new();
    let mut ctx = InferenceContext::new(model.config(), Box::new(backend));
    let sampler = Sampler::new(0.8, 40, 0.9); // temperature, top_k, top_p
    
    // Encode prompt
    let tokens = tokenizer.encode("Hello, world!", true)?;
    
    // Generate
    let mut output_tokens = tokens.clone();
    for _ in 0..50 {
        let logits = model.forward(&output_tokens[output_tokens.len()-1..], &mut ctx)?;
        let next_token = sampler.sample(&logits, &output_tokens);
        output_tokens.push(next_token);
        
        // Decode and print
        if let Ok(text) = tokenizer.decode(&[next_token]) {
            print!("{}", text);
        }
    }
    
    Ok(())
}

CLI Reference

llama-gguf <COMMAND>

Commands:
  info         Display model information
  run          Run inference on a model
  chat         Interactive chat mode
  serve        Start HTTP server (with --features server)
  quantize     Quantize a model
  bench        Benchmark model performance
  embed        Extract embeddings
  download     Download a model from HuggingFace Hub
  models       Manage cached models
  rag          RAG operations (with --features rag)
  init-config  Generate example config file
  manpages     Generate and install man pages
  help         Print help

Run Options:
  -p, --prompt <PROMPT>      Input prompt
  -n, --max-tokens <N>       Maximum tokens to generate [default: 128]
  -t, --temperature <T>      Sampling temperature [default: 0.8]
  -k, --top-k <K>            Top-k sampling [default: 40]
      --top-p <P>            Top-p (nucleus) sampling [default: 0.9]
      --repeat-penalty <R>   Repetition penalty [default: 1.1]
  -s, --seed <SEED>          Random seed for reproducibility
      --gpu                  Use GPU acceleration (requires GPU feature)

Performance

Benchmarked on Intel i9-13900K (24 cores, AVX2) with 64GB RAM:

Model	Quantization	Tokens/sec	Notes
Qwen2.5-0.5B	Q4_K_M	~1.2 t/s	896 hidden dim
TinyLlama-1.1B	Q4_K_M	~1.5 t/s	2048 hidden dim
Mistral-7B	Q4_K_M	~0.3 t/s	4096 hidden dim

Current implementation prioritizes correctness over speed. Performance optimizations (batch processing, better SIMD utilization) are planned.

Performance varies by hardware, model size, context length, and quantization.

Contributing

Contributions are welcome! Please see AGENTS.md for development guidelines.

License

Licensed under either of:

Apache License, Version 2.0 (LICENSE-APACHE)
MIT License (LICENSE-MIT)

at your option.

Acknowledgments

llama.cpp - The original implementation
GGML - Tensor library and GGUF format
pgvector - PostgreSQL vector similarity search

Lexmata LLC - jquinn@lexmata.ai

llama-gguf 0.14.0

llama-gguf

Features

Installation

From crates.io

From Source

System Installation with Man Pages

As a Library

Quick Start

Download a Model

Run Inference

Model Information

Supported Models

Quantization Formats

Feature Flags

GPU Acceleration

CUDA (NVIDIA GPUs)

Metal (Apple Silicon / macOS)

DirectX 12 (Windows)

Vulkan (Cross-platform)

RAG (Retrieval-Augmented Generation)

Setup

Library Usage

Features

CLI

ONNX Support

Library Usage

CLI Reference

Performance

Contributing

License

Acknowledgments