Crate rig_llama_cpp

Expand description

§rig-llama-cpp

A Rig provider that runs GGUF models locally via llama.cpp, with optional Vulkan GPU acceleration.

This crate implements Rig’s rig::completion::CompletionModel and rig::embeddings::EmbeddingModel traits so that any GGUF model can be used as a drop-in replacement for cloud-based providers. It supports:

Completion and streaming — both one-shot and token-by-token responses.
Tool calling — models with OpenAI-compatible chat templates can invoke tools.
Reasoning / thinking — extended thinking output is forwarded when the model supports it.
Configurable sampling — top-p, top-k, min-p, temperature, presence and repetition penalties.
Embeddings — generate text embeddings using GGUF embedding models.

§Feature flags

There is no default GPU backend — pick exactly the one that matches your hardware. With no feature enabled the build is CPU-only.

GPU backends (forwarded to llama-cpp-2):

vulkan — cross-vendor GPU (recommended on Linux/Windows when CUDA/ROCm aren’t set up).
cuda — NVIDIA GPUs with the CUDA toolkit installed.
metal — Apple Silicon / macOS.
rocm — AMD GPUs on Linux with the ROCm toolchain.

Other:

openmp — OpenMP CPU threading; orthogonal to the GPU backends and may be combined with any of them.
mtmd — multimodal (vision) inference; required for Client::from_gguf_with_mmproj and ClientBuilder::mmproj.

Examples:

cargo build --features vulkan
cargo build --features cuda
cargo build --features "vulkan,mtmd"

Backend support depends on the corresponding llama-cpp-2 feature and any required native toolchain or system libraries being available on the host machine.

§Quick start

use rig::client::CompletionClient;
use rig::completion::Prompt;

let client = rig_llama_cpp::Client::builder("path/to/model.gguf")
    .n_ctx(8192)
    .build()?;

let agent = client
    .agent("local")
    .preamble("You are a helpful assistant.")
    .max_tokens(512)
    .build();

let response = agent.prompt("Hello!").await?;
println!("{response}");

Structs§

CheckpointParams: Tunable parameters for the in-memory state-checkpoint cache used to preserve KV/recurrent state across chat turns for hybrid models.
Client: The llama.cpp completion client.
ClientBuilder: Builder for Client.
EmbeddingClient: The llama.cpp embedding client.
EmbeddingModelHandle: A handle to a loaded embedding model that implements Rig’s rig::embeddings::EmbeddingModel trait.
FitParams: Configuration for automatic GPU/CPU layer fitting.
KvCacheParams: KV cache quantization configuration.
Model: A handle to a loaded model that implements Rig’s CompletionModel trait.
RawResponse: Raw completion response returned by the model.
SamplingParams: Sampling parameters that control token generation.
StreamChunk: A single chunk emitted during streaming inference.

Enums§

KvCacheType: Data type used for an entry in the attention KV cache.
LoadError: Failure modes returned when constructing or reloading a crate::Client or crate::EmbeddingClient.

Crate rig_llama_cpp

Crate rig_llama_cpp Copy item path

§rig-llama-cpp

§Feature flags

§Quick start

Structs§

Enums§

Crate rig_llama_cpp