llama-crab 0.1.1

Safe, ergonomic and complete Rust bindings for llama.cpp
docs.rs failed to build llama-crab-0.1.1
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Visit the last successful build: llama-crab-0.1.201

llama-crab

Safe, ergonomic and complete Rust bindings for llama.cpp.

Inspired by llama-cpp-rs and the feature completeness of llama-cpp-python.

License: MIT License: Apache 2.0 MSRV: 1.80

llama-crab provides two crates:

Crate Purpose
llama-crab-sys Low-level, hand-curated FFI over llama.h, ggml.h, gguf.h (and mtmd.h) generated via bindgen and cmake.
llama-crab Safe, idiomatic Rust API: LlamaModel, LlamaContext, sampling chains, chat templates, tool calling, multimodal, speculative decoding, caching, embeddings, reranking.

Quickstart

Add to your Cargo.toml:

[dependencies]
llama-crab = "0.1"

Load a GGUF model and generate text:

use llama_crab::{Llama, LlamaParams};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let llama = Llama::load(LlamaParams::default()
        .with_model_path("models/llama-3.1-8b-instruct-q4_k_m.gguf")
        .with_n_ctx(2048)
        .with_n_gpu_layers(99))?;

    let response = llama.create_completion("Once upon a time", 64)?;
    println!("{}", response.text);
    Ok(())
}

Feature matrix

Feature Status
GGUF model loading (mmap, mlock)
Multi-GPU layer offload (Metal, CUDA, Vulkan, HIP)
KV cache quantization (Q2_K … Q8_K, IQ*)
RoPE scaling (linear, yarn, longrope)
Flash attention, SWA, MTP
All sampling strategies (greedy, top-k/p, min-p, typical, xtc, mirostat v1/v2, dry, adaptive_p, infill, logit-bias, grammar, …)
Custom samplers (Rust C-ABI vtable)
GBNF grammar + JSON schema constrained decoding
Chat templates (Jinja2 subset + 20+ builtins)
Tool calling (functionary v1/v2, chatml, hermes, qwen, llama-3)
Streaming JSON parsers (incremental tool-call deltas)
Embeddings (mean/cls/last pooling + L2 normalize)
Reranking (rank pooling)
FIM infill (PSM/SPM)
Speculative decoding (prompt-lookup n-gram + custom draft models)
State save/load (full + per-sequence, with flags)
Prompt + KV cache (RAM/Disk, prefix-match)
Multimodal (mtmd): vision + audio chat handlers ✅ (feature mtmd)
HF AutoTokenizer (feature hf-tokenizer)
llguidance (feature llguidance)
OpenAI-compatible HTTP server ⛔ out of v0.1 (planned as llama-crab-server)

Backends

Backend Feature Default?
CPU (OpenMP) openmp
Apple Metal (macOS/iOS) metal ✅ on macOS aarch64
NVIDIA CUDA cuda
NVIDIA CUDA (no VMM) cuda-no-vmm
Vulkan vulkan
AMD ROCm/HIP rocm
Dynamic linking dynamic-link
System GGML system-ggml
Dynamic backends dynamic-backends

License

Dual-licensed under either of:

at your option.