rig-llama-cpp
Run GGUF models locally inside your Rig agents.
A Rig completion provider that runs GGUF models locally via llama.cpp and their Rust bindings llama-cpp-2.
Drop it in wherever you'd use a cloud provider — same CompletionModel trait, same agent API, but inference happens on your hardware with no API keys, no rate limits, and no data leaving the machine.
Usage
use CompletionClient;
use Prompt;
use Client;
async
The legacy positional Client::from_gguf(...) constructor is still
available for callers pinned to the 0.1.x API.
Features
- Local inference with any GGUF model
- Completion and streaming support
- Tool calling (for models with OpenAI-compatible chat templates)
- Reasoning / thinking output
- Vision (multimodal) inference for models with an
mmprojprojector — opt in via themtmdfeature - Automatic GPU/CPU layer fitting — llama.cpp probes available device memory and picks
n_gpu_layersfor you, no manual tuning required - Backend selection via Cargo feature flags
- Configurable sampling parameters (top-p, top-k, min-p, temperature, penalties)
Feature Flags
No default GPU backend — pick the one that matches your hardware:
| Feature | Use for |
|---|---|
| (none) | CPU-only inference |
vulkan |
Cross-vendor GPU on Linux/Windows |
cuda |
NVIDIA GPUs |
metal |
Apple Silicon / macOS |
rocm |
AMD GPUs on Linux |
openmp |
OpenMP CPU threading; combine with any GPU backend |
mtmd |
Multimodal (vision) inference; enables ClientBuilder::mmproj |
Toolchain and runtime requirements per backend are documented upstream
in llama.cpp's build guide.
A successful build does not guarantee a successful run — if backend
init fails at runtime, [LoadError::BackendInit] is returned rather
than panicking, so the application can fall back gracefully.
Examples
MODEL_PATH=./model.gguf
MODEL_PATH=./model.gguf
MODEL_PATH=./model.gguf
MODEL_PATH=./model.gguf
MODEL_PATH=./model.gguf
MODEL_PATH=./embedding-model.gguf
# Vision (requires mtmd feature + mmproj file)
MODEL_PATH=./vision-model.gguf MMPROJ_PATH=./mmproj.gguf IMAGE_PATH=./image.jpg \
# Hot-swap the loaded model on the same worker thread
RIG_MODEL_A=./model_a.gguf RIG_MODEL_B=./model_b.gguf
N_GPU_LAYERS=20 can be used to offload 20 layers to the GPU.
By default, llama.cpp backend logs are suppressed so streaming and test output stay readable.
Set RIG_LLAMA_CPP_LOGS=1 to re-enable raw backend logs when debugging model startup or decode issues.
Testing
# Fast unit tests + doctests — no model required, run on every CI build.
The full integration suite (tests/e2e/) covers streaming completions,
vision, tool roundtrips, structured output, KV-cache quantization,
embedding, and sequential model reload. All tests are #[ignore]d and
auto-download their fixtures via hf-hub into the standard HuggingFace
cache (~/.cache/huggingface/hub) on first run — plan for ~20 GB.
Backend compilation is already covered upstream by llama-cpp-rs, and
the model fixtures are too large for hosted runners, so the e2e suite
does not run in CI.
Contributing
Issues and pull requests are welcome at github.com/camperking/rig-llama-cpp.
Before opening a PR, please run the same checks CI does
(.github/workflows/ci.yml):
RUSTDOCFLAGS="-D warnings"
If your change touches inference behaviour, validate it locally with
cargo test --test e2e --features mtmd -- --ignored --nocapture — the
fixtures auto-download on first run (~20 GB; see the
Testing section).
For changes that affect the public API or the embedded llama-cpp-2
version, add an entry to CHANGELOG.md under
[Unreleased]. The crate's pre-1.0 SemVer policy is documented at the
top of that file.
License
Licensed under the MIT License. See Cargo.toml for dependency details.