docs.rs failed to build llama-crab-0.1.1
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

Visit the last successful build: llama-crab-0.1.201

llama-crab

Safe, ergonomic and complete Rust bindings for llama.cpp.

Inspired by llama-cpp-rs and the feature completeness of llama-cpp-python.

llama-crab provides two crates:

Crate	Purpose
`llama-crab-sys`	Low-level, hand-curated FFI over `llama.h`, `ggml.h`, `gguf.h` (and `mtmd.h`) generated via `bindgen` and `cmake`.
`llama-crab`	Safe, idiomatic Rust API: `LlamaModel`, `LlamaContext`, sampling chains, chat templates, tool calling, multimodal, speculative decoding, caching, embeddings, reranking.

Quickstart

Add to your Cargo.toml:

[dependencies]
llama-crab = "0.1"

Load a GGUF model and generate text:

use llama_crab::{Llama, LlamaParams};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let llama = Llama::load(LlamaParams::default()
        .with_model_path("models/llama-3.1-8b-instruct-q4_k_m.gguf")
        .with_n_ctx(2048)
        .with_n_gpu_layers(99))?;

    let response = llama.create_completion("Once upon a time", 64)?;
    println!("{}", response.text);
    Ok(())
}

Feature matrix

Feature	Status
GGUF model loading (mmap, mlock)	✅
Multi-GPU layer offload (Metal, CUDA, Vulkan, HIP)	✅
KV cache quantization (Q2_K … Q8_K, IQ*)	✅
RoPE scaling (linear, yarn, longrope)	✅
Flash attention, SWA, MTP	✅
All sampling strategies (greedy, top-k/p, min-p, typical, xtc, mirostat v1/v2, dry, adaptive_p, infill, logit-bias, grammar, …)	✅
Custom samplers (Rust C-ABI vtable)	✅
GBNF grammar + JSON schema constrained decoding	✅
Chat templates (Jinja2 subset + 20+ builtins)	✅
Tool calling (functionary v1/v2, chatml, hermes, qwen, llama-3)	✅
Streaming JSON parsers (incremental tool-call deltas)	✅
Embeddings (mean/cls/last pooling + L2 normalize)	✅
Reranking (rank pooling)	✅
FIM infill (PSM/SPM)	✅
Speculative decoding (prompt-lookup n-gram + custom draft models)	✅
State save/load (full + per-sequence, with flags)	✅
Prompt + KV cache (RAM/Disk, prefix-match)	✅
Multimodal (mtmd): vision + audio chat handlers	✅ (feature `mtmd`)
HF AutoTokenizer (feature `hf-tokenizer`)	✅
llguidance (feature `llguidance`)	✅
OpenAI-compatible HTTP server	⛔ out of v0.1 (planned as `llama-crab-server`)

Backends

Backend	Feature	Default?
CPU (OpenMP)	`openmp`	✅
Apple Metal (macOS/iOS)	`metal`	✅ on macOS aarch64
NVIDIA CUDA	`cuda`	–
NVIDIA CUDA (no VMM)	`cuda-no-vmm`	–
Vulkan	`vulkan`	–
AMD ROCm/HIP	`rocm`	–
Dynamic linking	`dynamic-link`	–
System GGML	`system-ggml`	–
Dynamic backends	`dynamic-backends`	–

License

Dual-licensed under either of:

Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
MIT License (LICENSE-MIT or https://opensource.org/licenses/MIT)

at your option.

llama-crab 0.1.1

llama-crab

Quickstart

Feature matrix

Backends

License