llama-crab

Safe, ergonomic Rust bindings for llama.cpp.

llama-crab provides:

Low-level FFI bindings to llama.cpp, ggml, gguf and mtmd through llama-crab-sys.
A safe high-level Rust API for model loading, text completion, chat completion and infill.
Sampling chains, grammar-constrained decoding and JSON-Schema to GBNF conversion.
Chat templates, tool-call parsing and OpenAI-compatible data structures.
Embeddings, reranking, prompt cache, session state and speculative decoding.
Multimodal support through mtmd for vision and audio capable GGUF models.
Hardware backends for CPU, Metal, CUDA, Vulkan and ROCm through Cargo features.

Documentation is available at docs.rs/llama-crab and in the mdBook user guide.

Installation

Add the crate to your Cargo.toml:

[dependencies]
llama-crab = "0.1"

By default, llama-crab enables CPU OpenMP support and Apple Metal on aarch64 macOS. To choose backends explicitly, disable default features and enable the ones you need:

[dependencies]
llama-crab = { version = "0.1", default-features = false, features = ["cuda", "openmp"] }

The crate builds the bundled llama.cpp sources through CMake. You need:

Rust 1.88 or newer.
CMake 3.18 or newer.
A C and C++ compiler supported by llama.cpp.
A platform SDK when using GPU backends such as Metal, CUDA, Vulkan or ROCm.

Cargo Features

Feature	Description
`openmp`	CPU backend with OpenMP. Enabled by default.
`metal`	Apple Metal backend. Enabled by default on `aarch64` macOS.
`cuda`	NVIDIA CUDA backend.
`cuda-no-vmm`	CUDA backend without virtual memory management.
`vulkan`	Vulkan backend.
`rocm`	AMD ROCm/HIP backend.
`mtmd`	Multimodal support through `mtmd.h`; enables image/audio helpers.
`common`	Builds llama.cpp common utilities used by chat and grammar helpers.
`llguidance`	Enables the llguidance sampler integration.
`hf-tokenizer`	Enables Hugging Face tokenizer support.
`disk-cache`	Enables the persistent `sled`-backed prompt cache.
`dynamic-link`	Links llama.cpp as a shared object.
`dynamic-backends`	Loads GGML backends dynamically.
`system-ggml`	Uses a system GGML installation instead of the bundled copy.

Basic Usage

Load a GGUF model and generate a text completion:

use llama_crab::{Llama, LlamaParams};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut llama = Llama::load(
        LlamaParams::new("models/model.gguf")
            .with_n_ctx(2048)
            .with_n_gpu_layers(99),
    )?;

    let response = llama.create_completion("The capital of France is", 32)?;
    println!("{}", response.text);

    Ok(())
}

Chat Completion

Chat completion accepts a list of role-based messages. Built-in templates can be selected explicitly when you need deterministic formatting.

use llama_crab::chat::BuiltinTemplate;
use llama_crab::high_level::chat_completion::{create_chat_completion_with, ChatMessage};
use llama_crab::{Llama, LlamaParams, Role};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut llama = Llama::load(LlamaParams::new("models/instruct.gguf").with_n_ctx(4096))?;

    let messages = vec![
        ChatMessage::new(Role::System, "You are a concise assistant."),
        ChatMessage::new(Role::User, "Explain Rust ownership in one paragraph."),
    ];

    let response = create_chat_completion_with(
        &mut llama,
        &messages,
        BuiltinTemplate::ChatMl,
        &[],
        128,
    )?;

    println!("{}", response.content);
    Ok(())
}

JSON Schema and Grammar-Constrained Decoding

llama-crab can convert JSON Schema into GBNF grammar and use grammar samplers to constrain model output.

use llama_crab::high_level::completion::json_schema_grammar;
use serde_json::json;

let schema = json!({
    "type": "object",
    "properties": {
        "name": { "type": "string" },
        "age": { "type": "integer" }
    },
    "required": ["name", "age"]
});

let grammar = json_schema_grammar(&schema)?;
# let _ = grammar;
# Ok::<(), Box<dyn std::error::Error>>(())

See the structured example for a complete program.

Tool Calling

The chat module includes incremental tool-call parsing for common model formats, including ChatML, Mistral, Llama 3, Functionary and plain JSON object output.

use llama_crab::chat::tool_call::{ToolFormat, ToolParser};

let mut parser = ToolParser::new(ToolFormat::ChatMl);
let calls = parser.feed("<tool_call>{\"name\":\"get_weather\",\"arguments\":{\"city\":\"Tokyo\"}}</tool_call>");
# let _ = calls;
# Ok::<(), Box<dyn std::error::Error>>(())

See Chat & tool calling for supported formats and parser behavior.

Embeddings and Reranking

Enable embeddings when loading the model, then call Llama::embed to get an optionally L2-normalized vector.

use llama_crab::{Llama, LlamaParams};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut llama = Llama::load(
        LlamaParams::new("models/bge-small.gguf")
            .with_n_ctx(512)
            .with_embeddings(true),
    )?;

    let embedding = llama.embed("Rust is a systems programming language.", true)?;
    println!("dim = {}", embedding.len());

    Ok(())
}

See Embeddings & reranking, embeddings and embedding_search.

Multimodal Models

The mtmd feature exposes llama.cpp's multimodal pipeline for GGUF models that use a paired projector.

[dependencies]
llama-crab = { version = "0.1", features = ["mtmd"] }

Supported workflows include:

Loading a text model and an mmproj projector.
Decoding local images into MtmdBitmap.
Tokenizing text and media together with MtmdContext.
Evaluating multimodal chunks and continuing generation with normal samplers.

See Multimodal, vision, mtmd, and the integration tests under llama-crab/tests.

Speculative Decoding

Prompt-lookup speculative decoding is available through the speculative module. It can draft candidate tokens from repeated n-grams in the prompt and verify them with the main model.

See Speculative decoding and the speculative example.

Examples

The repository contains runnable example crates under examples/. The helper script downloads known-good GGUF fixtures on first run.

./examples/run.sh quickstart
./examples/run.sh chat
./examples/run.sh stateful_chat
./examples/run.sh embeddings
./examples/run.sh embedding_search
./examples/run.sh reranker
./examples/run.sh vision gemma4
./examples/run.sh vision lfm-vl
./examples/run.sh mtmd gemma4
./examples/run.sh tools
./examples/run.sh structured
./examples/run.sh speculative

Each example is a standalone Cargo crate and can be copied into another project.

Documentation

To serve the guide locally:

mdbook serve docs

Crates

Crate	Description
`llama-crab`	Safe high-level API and Rust abstractions.
`llama-crab-sys`	Low-level FFI package that builds and links llama.cpp.

Most applications should depend on llama-crab. Use llama-crab-sys only when you need direct access to raw llama.cpp symbols.

Development

Clone with submodules:

git clone --recursive https://github.com/DominguesM/llama-crab.git
cd llama-crab

Common checks:

cargo fmt --all -- --check
cargo test --workspace
cargo clippy -p llama-crab --all-features --all-targets
cargo doc -p llama-crab --no-deps --all-features

The minimum supported Rust version is 1.88 and is pinned in rust-toolchain.toml.

License

Licensed under the MIT License. See LICENSE-MIT.

Acknowledgements

llama-crab builds on llama.cpp.

Inspired by llama-cpp-rs and the feature completeness of llama-cpp-python.

llama-crab 0.1.201