llama-gguf
A high-performance Rust implementation of llama.cpp - an LLM inference engine with full GGUF support.
Features
- Full GGUF Support - Load any GGUF model file compatible with llama.cpp
- Multiple Architectures - LLaMA, Mistral, Qwen2, TinyLlama, DeepSeek, and more
- Quantization - All K-quant formats (Q2_K through Q8_0) plus F16/F32
- HuggingFace Integration - Download models directly from HuggingFace Hub
- Fast CPU Inference - SIMD-optimized (AVX2, AVX-512, NEON)
- CUDA GPU Acceleration - NVIDIA GPU support with custom CUDA kernels
- Grouped Query Attention - Efficient KV cache for GQA models
- Streaming Output - Token-by-token generation
Installation
From Source
The binary will be at target/release/llama-gguf.
As a Library
Add to your Cargo.toml:
[]
= { = "https://github.com/pegasusheavy/llama-gguf.git" }
Quick Start
Download a Model
# List available files in a repository
# Download a specific quantized model
Run Inference
# Basic text generation
# With sampling parameters
# Deterministic output (greedy sampling)
Model Information
Supported Models
| Model Family | Status | Notes |
|---|---|---|
| LLaMA/LLaMA2/LLaMA3 | ✅ | Full support |
| Mistral | ✅ | Use [INST]...[/INST] format |
| Qwen2/Qwen2.5 | ✅ | Includes attention biases |
| TinyLlama | ✅ | GQA support |
| DeepSeek-Coder | ✅ | Linear RoPE scaling |
| CodeLlama | ✅ | LLaMA-based |
| Yi | ✅ | LLaMA-based |
See MODEL_COMPATIBILITY.md for detailed compatibility information.
Quantization Formats
| Format | Bits | Quality | Size (7B) |
|---|---|---|---|
| Q2_K | 2 | Low | ~2.5 GB |
| Q3_K | 3 | Fair | ~3.0 GB |
| Q4_K_M | 4 | Good | ~4.0 GB |
| Q5_K_M | 5 | Better | ~5.0 GB |
| Q6_K | 6 | High | ~5.5 GB |
| Q8_0 | 8 | Excellent | ~7.0 GB |
| F16 | 16 | Full | ~14 GB |
Library Usage
use ;
CLI Reference
llama-gguf <COMMAND>
Commands:
run Run inference on a model
info Display model information
download Download a model from HuggingFace Hub
models Manage cached models
help Print help
Run Options:
-p, --prompt <PROMPT> Input prompt
-n, --max-tokens <N> Maximum tokens to generate [default: 128]
-t, --temperature <T> Sampling temperature [default: 0.8]
-k, --top-k <K> Top-k sampling [default: 40]
--top-p <P> Top-p (nucleus) sampling [default: 0.9]
--repeat-penalty <R> Repetition penalty [default: 1.1]
-s, --seed <SEED> Random seed for reproducibility
--gpu Use GPU acceleration (requires CUDA build)
Building with Features
# CPU only (default)
# With CUDA GPU support (requires CUDA toolkit)
CUDA_PATH=/opt/cuda
# With Vulkan support (experimental)
# With HTTP server
GPU Acceleration
CUDA (NVIDIA GPUs)
Enable GPU acceleration with the --gpu flag:
# Build with CUDA support
CUDA_PATH=/opt/cuda
# Run with GPU acceleration
Requirements:
- NVIDIA GPU with compute capability 6.0+
- CUDA Toolkit 12.0+ installed
- cudarc crate for CUDA bindings
Currently GPU-accelerated operations:
- Element-wise: add, mul, scale
- Activations: SiLU, GELU
- Normalization: RMS norm
- Softmax
- Vector-matrix multiplication (f32)
Still using CPU fallback:
- Quantized matrix operations (vec_mat_q)
- Attention computation
- RoPE positional embeddings
Note: Performance gains are currently limited as quantized operations remain on CPU. Full GPU acceleration of quantized inference is planned.
Performance
Benchmarked on Intel i9-13900K (24 cores, AVX2) with 64GB RAM:
| Model | Quantization | Tokens/sec | Notes |
|---|---|---|---|
| Qwen2.5-0.5B | Q4_K_M | ~1.2 t/s | 896 hidden dim |
| TinyLlama-1.1B | Q4_K_M | ~1.5 t/s | 2048 hidden dim |
| Mistral-7B | Q4_K_M | ~0.3 t/s | 4096 hidden dim |
Current implementation prioritizes correctness over speed. Performance optimizations (batch processing, better SIMD utilization) are planned.
Performance varies by hardware, model size, context length, and quantization.
Contributing
Contributions are welcome! Please see AGENTS.md for development guidelines.
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT License (LICENSE-MIT)
at your option.
Acknowledgments
Pegasus Heavy Industries LLC - pegasusheavyindustries@gmail.com