Crate llama_cpp

Expand description

High-level bindings to llama.cpp’s C API, providing a predictable, safe, and high-performance medium for interacting with Large Language Models (LLMs) on consumer-grade hardware.

Along with llama.cpp, this crate is still in an early state, and breaking changes may occur between versions. The high-level API, however, is fairly settled on.

To get started, create a LlamaModel and a LlamaSession:

use llama_cpp::{LlamaModel, LlamaParams, SessionParams};
use llama_cpp::standard_sampler::StandardSampler;

// Create a model from anything that implements `AsRef<Path>`:
let model = LlamaModel::load_from_file("path_to_model.gguf", LlamaParams::default()).expect("Could not load model");

// A `LlamaModel` holds the weights shared across many _sessions_; while your model may be
// several gigabytes large, a session is typically a few dozen to a hundred megabytes!
let mut ctx = model.create_session(SessionParams::default()).expect("Failed to create session");

// You can feed anything that implements `AsRef<[u8]>` into the model's context.
ctx.advance_context("This is the story of a man named Stanley.").unwrap();

// LLMs are typically used to predict the next word in a sequence. Let's generate some tokens!
let max_tokens = 1024;
let mut decoded_tokens = 0;

// `ctx.start_completing_with` creates a worker thread that generates tokens. When the completion
// handle is dropped, tokens stop generating!

let mut completions = ctx.start_completing_with(StandardSampler::default(), 1024).into_strings();

for completion in completions {
    print!("{completion}");
    let _ = io::stdout().flush();

    decoded_tokens += 1;

    if decoded_tokens > max_tokens {
        break;
    }
}

§Dependencies

This crate depends on (and builds atop) llama_cpp_sys, and builds llama.cpp from source. You’ll need libclang, cmake, and a C/C++ toolchain (clang is preferred) at the minimum. See llama_cpp_sys for more details.

The bundled GGML and llama.cpp binaries are statically linked by default, and their logs are re-routed through tracing instead of stderr. If you’re getting stuck, setting up tracing for more debug information should be at the top of your troubleshooting list!

§Undefined Behavior / Panic Safety

It should be impossible to trigger undefined behavior from this crate, and any UB is considered a critical bug. UB triggered downstream in llama.cpp or ggml should have issues filed and mirrored in llama_cpp-rs’s issue tracker.

While panics are considered less critical, this crate should never panic, and any panic should be considered a bug. We don’t want your control flow!

§Minimum Stable Rust Version (MSRV) Policy

This crates supports Rust 1.73.0 and above.

§License

MIT or Apache 2.0 (the “Rust” license), at your option.

Modules§

standard_sampler
The standard sampler implementation.

Structs§

CompletionHandle
A handle (and channel) to an ongoing completion job on an off thread.
EmbeddingsParams
Embeddings inference specific parameters.
LlamaInternalError
An error that occurred on the other side of the C FFI boundary.
LlamaModel
A llama.cpp model.
LlamaParams
Parameters for llama.
LlamaSession
An evaluation session for a llama.cpp model.
SessionParams
Session-specific parameters.
Token
A single token produced or consumed by a LlamaModel, without its associated context.
TokensToBytes
A wrapper struct around an iterator or stream of tokens, yielding Vec<u8> byte pieces for each token.
TokensToStrings
A wrapper struct around a CompletionHandle, yielding String tokens for each byte piece of the model’s output.

Enums§

LlamaContextError
An error raised while advancing the context in a LlamaSession.
LlamaLoadError
An error raised while loading a llama.cpp model.
LlamaTokenizationError
An error raised while tokenizing some input for a model.
SplitMode
A policy to split the model across multiple GPUs

Traits§

Sampler
This needs to be documented!