llama-cpp-4

Safe Rust bindings to llama.cpp.
Tracks upstream closely — designed to stay current rather than provide a thick abstraction layer.

llama.cpp version: 4fc4ec5 (b9859) · Crate version: 0.4.0

Add to your project

[dependencies]
llama-cpp-4 = "0.4.0"

# GPU support (pick one or more)
# llama-cpp-4 = { version = "0.4.0", features = ["cuda"] }
# llama-cpp-4 = { version = "0.4.0", features = ["metal"] }
# llama-cpp-4 = { version = "0.4.0", features = ["vulkan"] }

Prelude

Import the common inference types in one line:

use llama_cpp_4::prelude::*;

The prelude re-exports backend, model, context, batching, sampling, errors, fit/memory helpers, speculative-decoding types, and quantization symbols. The same core types are also at the crate root (llama_cpp_4::LlamaModel, etc.) if you prefer explicit paths.

Category	Types
Inference	`LlamaBackend`, `LlamaModel`, `LlamaContext`, `LlamaBatch`, `LlamaSampler`, `LlamaSamplerParams`, `LlamaToken`, `LlamaTokenDataArray`
Tokenising	`AddBos`, `Special`
Chat	`LlamaChatMessage`
Model introspection	`LlamaBackendDevice`, `LlamaBackendDeviceType`
Context tuning	`LlamaFlashAttnType`, `LlamaContextType`, `LlamaAttentionType`, `RopeScalingType`, `ParamsCloneError`
KV overrides	`ParamOverrideValue`
Memory / fit	`get_device_memory_data`, `fit_params`, `FitParams`, `MemoryBreakdownEntry`
Tensor capture	`TensorCapture`, `CapturedTensor`
Speculative	`MtpSession`, `Eagle3Session` (+ configs)
Quantization	`QuantizeParams`, `TensorTypeOverride`, `GgmlType`, `LlamaFtype`, `model_quantize`

See prelude on docs.rs for runnable examples (generation, chat, embeddings, memory estimation).

Feature flags

Feature	Default	Description
`openmp`	✅	Multi-threaded CPU inference via OpenMP
`mtmd`	✅	Multimodal (vision / audio) via `libmtmd`
`dynamic-link`	✅	Link llama.cpp as a shared library
`cuda`		NVIDIA GPU via CUDA
`metal`		Apple GPU via Metal
`vulkan`		Cross-platform GPU via Vulkan
`native`		CPU auto-tune for current arch (AVX2, NEON, …)
`rpc`		Remote compute backend

API overview

All snippets below assume use llama_cpp_4::prelude::*;.

Backend

// Initialise once per process. Configures hardware backends (CUDA, Metal, …).
let backend = LlamaBackend::init()?;

Loading a model

use std::pin::pin;

let mut params = LlamaModelParams::default().with_n_gpu_layers(99);
let params = pin!(params);

let model = LlamaModel::load_from_file(&backend, "model.gguf", &params)?;

println!("vocab size : {}", model.n_vocab());
println!("context len: {}", model.n_ctx_train());
println!("embed dim  : {}", model.n_embd());

// Multi-GPU / MoE introspection
println!("devices    : {}", model.n_devices());
println!("experts    : {}", model.n_expert());
for dev in model.devices() {
    let (free, total) = dev.memory();
    println!("  {} — {} / {} bytes free", dev.name()?, free, total);
}

Memory estimation (before full load)

use std::path::Path;

let report = get_device_memory_data(
    Path::new("model.gguf"),
    &LlamaModelParams::default().with_n_gpu_layers(99),
    &LlamaContextParams::default(),
    llama_cpp_sys_4::GGML_LOG_LEVEL_ERROR,
)?;
for entry in &report.entries {
    println!("projected: {} bytes", entry.used());
}

Auto-fit parameters to device memory

use llama_cpp_4::fit::{fit_params, FitParams};

let backend = LlamaBackend::init()?;
let fitted = fit_params(
    &backend,
    Path::new("model.gguf"),
    FitParams::default().with_n_ctx_min(512),
)?;

let model = LlamaModel::load_from_file(&backend, "model.gguf", &fitted.model_params)?;
let ctx = model.new_context(&backend, fitted.context_params)?;

Tokenising

let tokens = model.str_to_token("Hello, world!", AddBos::Always)?;
let text   = model.token_to_str(tokens[0], Special::Plaintext)?;
let bytes  = model.token_to_bytes(tokens[0], Special::Plaintext)?;

Chat template

let messages = vec![
    LlamaChatMessage::new("system".into(), "You are helpful.".into())?,
    LlamaChatMessage::new("user".into(),   "What is 2+2?".into())?,
];
let prompt = model.apply_chat_template(None, messages, true)?;

Creating a context

use std::num::NonZeroU32;

let params = LlamaContextParams::default()
    .with_n_ctx(NonZeroU32::new(4096))
    .with_n_batch(512)
    .with_n_threads(8)
    .with_flash_attn_type(LlamaFlashAttnType::Auto);

let mut ctx = model.new_context(&backend, params)?;

Batched decode (prefill + generation)

let mut batch = LlamaBatch::new(512, 1);

for (i, &tok) in tokens.iter().enumerate() {
    let last = i == tokens.len() - 1;
    batch.add(tok, i as i32, &[0], last)?;
}
ctx.decode(&mut batch)?;

batch.clear();
batch.add(new_token, pos, &[0], true)?;
ctx.decode(&mut batch)?;

Sampling

let sampler = LlamaSampler::chain_simple([
    LlamaSampler::top_k(40),
    LlamaSampler::top_p(0.95, 1),
    LlamaSampler::temp(0.8),
    LlamaSampler::dist(42),
]);

let token = sampler.sample(&ctx, batch.n_tokens() - 1);
if model.is_eog_token(token) { /* done */ }
let bytes = model.token_to_bytes(token, Special::Plaintext)?;

KV cache

ctx.clear_kv_cache_seq(Some(0), None, None)?; // clear sequence 0
ctx.clear_kv_cache();                          // clear all sequences

Embeddings

use std::num::NonZeroU32;

let params = LlamaContextParams::default()
    .with_embeddings(true)
    .with_n_ctx(NonZeroU32::new(512));
let mut ctx = model.new_context(&backend, params)?;

// ... fill batch, decode ...
let vec = ctx.embeddings_seq_ith(0)?;

Runtime memory breakdown

for entry in ctx.memory_breakdown() {
    println!("{}: {} bytes", entry.buft_name, entry.total());
}

Tensor capture (hidden states)

Hook cb_eval during decode to copy per-layer outputs ("l_out-N") or other named graph nodes:

use llama_cpp_4::prelude::*;

let mut capture = TensorCapture::for_layers(&[13, 20, 27]);
let ctx_params = LlamaContextParams::default().with_tensor_capture(&mut capture);
let mut ctx = model.new_context(&backend, ctx_params)?;

// ... fill batch, decode ...
ctx.decode(&mut batch)?;

if let Some(layer) = capture.get_layer(13) {
    println!("{} tokens × {} dims", layer.n_tokens(), layer.n_embd());
    let hidden = layer.token_embedding(0).unwrap();
}

See also context::tensor_capture and examples/eagle (EAGLE-3 uses specific anchor layers).

LoRA adapters

let adapter = model.load_lora_adapter("adapter.gguf", 1.0)?;
ctx.set_lora_adapter(&adapter, 1.0)?;
ctx.lora_adapter_remove()?;

Performance counters

let perf = ctx.timings();
println!("prompt eval: {:.2} ms", perf.t_p_eval_ms());
ctx.perf_context_reset();

Full example: text generation

use llama_cpp_4::prelude::*;
use std::num::NonZeroU32;

fn main() -> anyhow::Result<()> {
    let backend = LlamaBackend::init()?;
    let model = LlamaModel::load_from_file(
        &backend,
        "model.gguf",
        &LlamaModelParams::default(),
    )?;

    let ctx_params = LlamaContextParams::default()
        .with_n_ctx(NonZeroU32::new(2048));
    let mut ctx = model.new_context(&backend, ctx_params)?;

    let tokens = model.str_to_token("The answer is", AddBos::Always)?;
    let n_prompt = tokens.len();

    let mut batch = LlamaBatch::new(2048, 1);
    for (i, &tok) in tokens.iter().enumerate() {
        batch.add(tok, i as i32, &[0], i == n_prompt - 1)?;
    }
    ctx.decode(&mut batch)?;

    let sampler = LlamaSampler::chain_simple([
        LlamaSampler::temp(0.8),
        LlamaSampler::dist(0),
    ]);

    let mut pos = n_prompt as i32;
    let mut decoder = encoding_rs::UTF_8.new_decoder();

    for _ in 0..256 {
        let token = sampler.sample(&ctx, 0);
        if model.is_eog_token(token) {
            break;
        }

        let bytes = model.token_to_bytes(token, Special::Plaintext)?;
        let mut piece = String::new();
        decoder.decode_to_string(&bytes, &mut piece, false);
        print!("{piece}");

        batch.clear();
        batch.add(token, pos, &[0], true)?;
        ctx.decode(&mut batch)?;
        pos += 1;
    }
    Ok(())
}

Safety

This crate wraps a C++ library via FFI. The safe API prevents most misuse, but some patterns (e.g. using a context after its model is dropped) can still cause UB. File an issue if you spot any.

Examples in this repo

Crate	Description
`simple`	Single-turn completion
`chat`	Interactive multi-turn REPL
`openai-server`	OpenAI-compatible HTTP API
`mtp`	MTP speculative decoding
`eagle`	EAGLE-3 speculative decoding
`incremental-chat`	Incremental prefill while typing
`fit-params`	Auto-fit `n_ctx` / GPU layers to device memory

Requirements

Rust 1.75+
clang (for bindgen at build time)
A C++17 compiler (GCC 9+, Clang 10+, MSVC 2019+)
For CUDA: CUDA toolkit 11.8+
For Metal: Xcode 14+

Testing

Unit tests run without a model (vocab-only fixtures from the build tree when available):

cargo test -p llama-cpp-4

End-to-end integration tests load a real GGUF and exercise decode, generation, embeddings, memory breakdown, fit helpers, and tensor capture:

./scripts/fetch-test-model.sh
cargo test -p llama-cpp-4 --test test_integration -- --test-threads=1

Or point at any local checkpoint:

LLAMA_TEST_MODEL=/path/to/model.gguf \
  cargo test -p llama-cpp-4 --test test_integration -- --test-threads=1

Use --test-threads=1 because llama_decode is not safe to exercise in parallel across contexts in the same process.

llama-cpp-4 0.4.0