Skip to main content

Crate rig_llama_cpp

Crate rig_llama_cpp 

Source
Expand description

§rig-llama-cpp

A Rig provider that runs GGUF models locally via llama.cpp, with optional Vulkan GPU acceleration.

This crate implements Rig’s rig::completion::CompletionModel and rig::embeddings::EmbeddingModel traits so that any GGUF model can be used as a drop-in replacement for cloud-based providers. It supports:

  • Completion and streaming — both one-shot and token-by-token responses.
  • Tool calling — models with OpenAI-compatible chat templates can invoke tools.
  • Reasoning / thinking — extended thinking output is forwarded when the model supports it.
  • Configurable sampling — top-p, top-k, min-p, temperature, presence and repetition penalties.
  • Embeddings — generate text embeddings using GGUF embedding models.

§Feature flags

There is no default GPU backend — pick exactly the one that matches your hardware. With no feature enabled the build is CPU-only.

GPU backends (forwarded to llama-cpp-2):

  • vulkan — cross-vendor GPU (recommended on Linux/Windows when CUDA/ROCm aren’t set up).
  • cuda — NVIDIA GPUs with the CUDA toolkit installed.
  • metal — Apple Silicon / macOS.
  • rocm — AMD GPUs on Linux with the ROCm toolchain.

Other:

  • openmp — OpenMP CPU threading; orthogonal to the GPU backends and may be combined with any of them.
  • mtmd — multimodal (vision) inference; required for Client::from_gguf_with_mmproj and ClientBuilder::mmproj.

Examples:

cargo build --features vulkan
cargo build --features cuda
cargo build --features "vulkan,mtmd"

Backend support depends on the corresponding llama-cpp-2 feature and any required native toolchain or system libraries being available on the host machine.

§Quick start

use rig::client::CompletionClient;
use rig::completion::Prompt;

let client = rig_llama_cpp::Client::builder("path/to/model.gguf")
    .n_ctx(8192)
    .build()?;

let agent = client
    .agent("local")
    .preamble("You are a helpful assistant.")
    .max_tokens(512)
    .build();

let response = agent.prompt("Hello!").await?;
println!("{response}");

Structs§

CheckpointParams
Tunable parameters for the in-memory state-checkpoint cache used to preserve KV/recurrent state across chat turns for hybrid models.
Client
The llama.cpp completion client.
ClientBuilder
Builder for Client.
EmbeddingClient
The llama.cpp embedding client.
EmbeddingModelHandle
A handle to a loaded embedding model that implements Rig’s rig::embeddings::EmbeddingModel trait.
FitParams
Configuration for automatic GPU/CPU layer fitting.
KvCacheParams
KV cache quantization configuration.
Model
A handle to a loaded model that implements Rig’s CompletionModel trait.
RawResponse
Raw completion response returned by the model.
SamplingParams
Sampling parameters that control token generation.
StreamChunk
A single chunk emitted during streaming inference.

Enums§

KvCacheType
Data type used for an entry in the attention KV cache.
LoadError
Failure modes returned when constructing or reloading a crate::Client or crate::EmbeddingClient.