llama-cpp-v3

Safe and ergonomic Rust wrapper for llama.cpp with runtime dynamic loading and auto-downloading support.

Features

🚀 Runtime Backend Switching: Switch between CPU, CUDA, Vulkan, and SYCL without recompiling.
📦 Zero-Configuration Build: No need for a C++ compiler or llama.cpp source locally.
⏬ Auto-Download: Automatically downloads pre-built llama.cpp binaries from GitHub releases based on the selected backend and OS.
🛡️ Safe API: RAII-style wrappers for models, contexts, and samplers.
🔄 Latest llama.cpp: Support for the modern GGUF and vocabulary APIs.

Installation

Add this to your Cargo.toml:

[dependencies]
llama-cpp-v3 = "0.1.1" # Example version

Quick Start

use llama_cpp_v3::{LlamaBackend, LlamaModel, LlamaContext, LlamaSampler, LoadOptions, Backend};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 1. Load the backend (downloads DLL if missing)
    let backend = LlamaBackend::load(LoadOptions {
        backend: Backend::Cpu,
        app_name: "my_app",
        version: None, // uses latest
        explicit_path: None,
        cache_dir: None,
    })?;

    // 2. Load a model
    let model = LlamaModel::load_from_file(&backend, "tinyllama.gguf", LlamaModel::default_params(&backend))?;

    // 3. Create a context
    let mut ctx = LlamaContext::new(&model, LlamaContext::default_params(&model))?;

    // 4. Tokenize and generate
    let tokens = model.tokenize("Hello, my name is", true, true)?;
    // ... fill batch and decode ...

    Ok(())
}

How it Works

This crate uses llama-cpp-sys-v3 to dynamically load llama.dll (Windows) or libllama.so (Linux). When you call LlamaBackend::load, the library:

Checks the local cache for the specified backend.
If missing, it uses the GitHub API to find the latest llama.cpp release.
Downloads the appropriate binary zip for your OS and Architecture.
Extracts and loads the library at runtime.

License