Expand description
§rig-llama-cpp
A Rig provider that runs GGUF models locally via llama.cpp, with optional Vulkan GPU acceleration.
This crate implements Rig’s rig::completion::CompletionModel and rig::embeddings::EmbeddingModel traits
so that any GGUF model can be used as a drop-in replacement for cloud-based providers. It supports:
- Completion and streaming — both one-shot and token-by-token responses.
- Tool calling — models with OpenAI-compatible chat templates can invoke tools.
- Reasoning / thinking — extended thinking output is forwarded when the model supports it.
- Configurable sampling — top-p, top-k, min-p, temperature, presence and repetition penalties.
- Embeddings — generate text embeddings using GGUF embedding models.
§Feature flags
There is no default GPU backend — pick exactly the one that matches your hardware. With no feature enabled the build is CPU-only.
GPU backends (forwarded to llama-cpp-2):
vulkan— cross-vendor GPU (recommended on Linux/Windows when CUDA/ROCm aren’t set up).cuda— NVIDIA GPUs with the CUDA toolkit installed.metal— Apple Silicon / macOS.rocm— AMD GPUs on Linux with the ROCm toolchain.
Other:
openmp— OpenMP CPU threading; orthogonal to the GPU backends and may be combined with any of them.mtmd— multimodal (vision) inference; required forClient::from_gguf_with_mmprojandClientBuilder::mmproj.
Examples:
cargo build --features vulkan
cargo build --features cuda
cargo build --features "vulkan,mtmd"Backend support depends on the corresponding llama-cpp-2 feature and any required
native toolchain or system libraries being available on the host machine.
§Quick start
use rig::client::CompletionClient;
use rig::completion::Prompt;
let client = rig_llama_cpp::Client::builder("path/to/model.gguf")
.n_ctx(8192)
.build()?;
let agent = client
.agent("local")
.preamble("You are a helpful assistant.")
.max_tokens(512)
.build();
let response = agent.prompt("Hello!").await?;
println!("{response}");Structs§
- Checkpoint
Params - Tunable parameters for the in-memory state-checkpoint cache used to preserve KV/recurrent state across chat turns for hybrid models.
- Client
- The llama.cpp completion client.
- Client
Builder - Builder for
Client. - Embedding
Client - The llama.cpp embedding client.
- Embedding
Model Handle - A handle to a loaded embedding model that implements Rig’s
rig::embeddings::EmbeddingModeltrait. - FitParams
- Configuration for automatic GPU/CPU layer fitting.
- KvCache
Params - KV cache quantization configuration.
- Model
- A handle to a loaded model that implements Rig’s
CompletionModeltrait. - RawResponse
- Raw completion response returned by the model.
- Sampling
Params - Sampling parameters that control token generation.
- Stream
Chunk - A single chunk emitted during streaming inference.
Enums§
- KvCache
Type - Data type used for an entry in the attention KV cache.
- Load
Error - Failure modes returned when constructing or reloading a
crate::Clientorcrate::EmbeddingClient.