llama-cpp-4
Safe Rust bindings to llama.cpp.
Tracks upstream closely — designed to stay current rather than provide a thick abstraction layer.
llama.cpp version: 4fc4ec5 (b9859) · Crate version: 0.4.0
Add to your project
[]
= "0.4.0"
# GPU support (pick one or more)
# llama-cpp-4 = { version = "0.4.0", features = ["cuda"] }
# llama-cpp-4 = { version = "0.4.0", features = ["metal"] }
# llama-cpp-4 = { version = "0.4.0", features = ["vulkan"] }
Prelude
Import the common inference types in one line:
use *;
The prelude re-exports backend, model, context, batching, sampling, errors, fit/memory
helpers, speculative-decoding types, and quantization symbols. The same core types are
also at the crate root (llama_cpp_4::LlamaModel, etc.) if you prefer explicit paths.
| Category | Types |
|---|---|
| Inference | LlamaBackend, LlamaModel, LlamaContext, LlamaBatch, LlamaSampler, LlamaSamplerParams, LlamaToken, LlamaTokenDataArray |
| Tokenising | AddBos, Special |
| Chat | LlamaChatMessage |
| Model introspection | LlamaBackendDevice, LlamaBackendDeviceType |
| Context tuning | LlamaFlashAttnType, LlamaContextType, LlamaAttentionType, RopeScalingType, ParamsCloneError |
| KV overrides | ParamOverrideValue |
| Memory / fit | get_device_memory_data, fit_params, FitParams, MemoryBreakdownEntry |
| Tensor capture | TensorCapture, CapturedTensor |
| Speculative | MtpSession, Eagle3Session (+ configs) |
| Quantization | QuantizeParams, TensorTypeOverride, GgmlType, LlamaFtype, model_quantize |
See prelude on docs.rs for runnable examples (generation, chat, embeddings, memory estimation).
Feature flags
| Feature | Default | Description |
|---|---|---|
openmp |
✅ | Multi-threaded CPU inference via OpenMP |
mtmd |
✅ | Multimodal (vision / audio) via libmtmd |
dynamic-link |
✅ | Link llama.cpp as a shared library |
cuda |
NVIDIA GPU via CUDA | |
metal |
Apple GPU via Metal | |
vulkan |
Cross-platform GPU via Vulkan | |
native |
CPU auto-tune for current arch (AVX2, NEON, …) | |
rpc |
Remote compute backend |
API overview
All snippets below assume use llama_cpp_4::prelude::*;.
Backend
// Initialise once per process. Configures hardware backends (CUDA, Metal, …).
let backend = init?;
Loading a model
use pin;
let mut params = default.with_n_gpu_layers;
let params = pin!;
let model = load_from_file?;
println!;
println!;
println!;
// Multi-GPU / MoE introspection
println!;
println!;
for dev in model.devices
Memory estimation (before full load)
use Path;
let report = get_device_memory_data?;
for entry in &report.entries
Auto-fit parameters to device memory
use ;
let backend = init?;
let fitted = fit_params?;
let model = load_from_file?;
let ctx = model.new_context?;
Tokenising
let tokens = model.str_to_token?;
let text = model.token_to_str?;
let bytes = model.token_to_bytes?;
Chat template
let messages = vec!;
let prompt = model.apply_chat_template?;
Creating a context
use NonZeroU32;
let params = default
.with_n_ctx
.with_n_batch
.with_n_threads
.with_flash_attn_type;
let mut ctx = model.new_context?;
Batched decode (prefill + generation)
let mut batch = new;
for in tokens.iter.enumerate
ctx.decode?;
batch.clear;
batch.add?;
ctx.decode?;
Sampling
let sampler = chain_simple;
let token = sampler.sample;
if model.is_eog_token
let bytes = model.token_to_bytes?;
KV cache
ctx.clear_kv_cache_seq?; // clear sequence 0
ctx.clear_kv_cache; // clear all sequences
Embeddings
use NonZeroU32;
let params = default
.with_embeddings
.with_n_ctx;
let mut ctx = model.new_context?;
// ... fill batch, decode ...
let vec = ctx.embeddings_seq_ith?;
Runtime memory breakdown
for entry in ctx.memory_breakdown
Tensor capture (hidden states)
Hook cb_eval during decode to copy per-layer outputs ("l_out-N") or other
named graph nodes:
use *;
let mut capture = for_layers;
let ctx_params = default.with_tensor_capture;
let mut ctx = model.new_context?;
// ... fill batch, decode ...
ctx.decode?;
if let Some = capture.get_layer
See also context::tensor_capture and
examples/eagle (EAGLE-3 uses specific anchor layers).
LoRA adapters
let adapter = model.load_lora_adapter?;
ctx.set_lora_adapter?;
ctx.lora_adapter_remove?;
Performance counters
let perf = ctx.timings;
println!;
ctx.perf_context_reset;
Full example: text generation
use *;
use NonZeroU32;
Safety
This crate wraps a C++ library via FFI. The safe API prevents most misuse, but some patterns (e.g. using a context after its model is dropped) can still cause UB. File an issue if you spot any.
Examples in this repo
| Crate | Description |
|---|---|
simple |
Single-turn completion |
chat |
Interactive multi-turn REPL |
openai-server |
OpenAI-compatible HTTP API |
mtp |
MTP speculative decoding |
eagle |
EAGLE-3 speculative decoding |
incremental-chat |
Incremental prefill while typing |
fit-params |
Auto-fit n_ctx / GPU layers to device memory |
Requirements
- Rust 1.75+
clang(for bindgen at build time)- A C++17 compiler (GCC 9+, Clang 10+, MSVC 2019+)
- For CUDA: CUDA toolkit 11.8+
- For Metal: Xcode 14+
Testing
Unit tests run without a model (vocab-only fixtures from the build tree when available):
End-to-end integration tests load a real GGUF and exercise decode, generation, embeddings, memory breakdown, fit helpers, and tensor capture:
Or point at any local checkpoint:
LLAMA_TEST_MODEL=/path/to/model.gguf \
Use --test-threads=1 because llama_decode is not safe to exercise in parallel across
contexts in the same process.