Crate llama_cpp

source ·
Expand description

High-level bindings to llama.cpp’s C API, providing a predictable, safe, and high-performance medium for interacting with Large Language Models (LLMs) on consumer-grade hardware.

Along with llama.cpp, this crate is still in an early state, and breaking changes may occur between versions. The high-level API, however, is fairly settled on.

To get started, create a LlamaModel and a LlamaSession:

use llama_cpp::{LlamaModel, LlamaParams, SessionParams};
use llama_cpp::standard_sampler::StandardSampler;

// Create a model from anything that implements `AsRef<Path>`:
let model = LlamaModel::load_from_file("path_to_model.gguf", LlamaParams::default()).expect("Could not load model");

// A `LlamaModel` holds the weights shared across many _sessions_; while your model may be
// several gigabytes large, a session is typically a few dozen to a hundred megabytes!
let mut ctx = model.create_session(SessionParams::default()).expect("Failed to create session");

// You can feed anything that implements `AsRef<[u8]>` into the model's context.
ctx.advance_context("This is the story of a man named Stanley.").unwrap();

// LLMs are typically used to predict the next word in a sequence. Let's generate some tokens!
let max_tokens = 1024;
let mut decoded_tokens = 0;

// `ctx.start_completing_with` creates a worker thread that generates tokens. When the completion
// handle is dropped, tokens stop generating!

let mut completions = ctx.start_completing_with(StandardSampler::default(), 1024).into_strings();

for completion in completions {
    print!("{completion}");
    let _ = io::stdout().flush();

    decoded_tokens += 1;

    if decoded_tokens > max_tokens {
        break;
    }
}

§Dependencies

This crate depends on (and builds atop) llama_cpp_sys, and builds llama.cpp from source. You’ll need libclang, cmake, and a C/C++ toolchain (clang is preferred) at the minimum. See llama_cpp_sys for more details.

The bundled GGML and llama.cpp binaries are statically linked by default, and their logs are re-routed through tracing instead of stderr. If you’re getting stuck, setting up tracing for more debug information should be at the top of your troubleshooting list!

§Undefined Behavior / Panic Safety

It should be impossible to trigger undefined behavior from this crate, and any UB is considered a critical bug. UB triggered downstream in llama.cpp or ggml should have issues filed and mirrored in llama_cpp-rs’s issue tracker.

While panics are considered less critical, this crate should never panic, and any panic should be considered a bug. We don’t want your control flow!

§Minimum Stable Rust Version (MSRV) Policy

This crates supports Rust 1.73.0 and above.

§License

MIT or Apache 2.0 (the “Rust” license), at your option.

Modules§

Structs§

Enums§

Traits§

  • This needs to be documented!