Crate llama_cpp

Expand description

High-level bindings to llama.cpp’s C API, providing a predictable, safe, and high-performance medium for interacting with Large Language Models (LLMs) on consumer-grade hardware.

Along with llama.cpp, his crate is still in an early state, and breaking changes may occur between versions. The high-level API, however, is fairly settled on.

To get started, create a LlamaModel and a LlamaSession:

use llama_cpp::LlamaModel;

// Create a model from anything that implements `AsRef<Path>`:
let model = LlamaModel::load_from_file("path_to_model.gguf").expect("Could not load model");

// A `LlamaModel` holds the weights shared across many _sessions_; while your model may be
// several gigabytes large, a session is typically a few dozen to a hundred megabytes!
let mut ctx = model.create_session();

// You can feed anything that implements `AsRef<[u8]>` into the model's context.
ctx.advance_context("This is the story of a man named Stanley.").unwrap();

// LLMs are typically used to predict the next word in a sequence. Let's generate some tokens!
let max_tokens = 1024;
let mut decoded_tokens = 0;

// `ctx.get_completions` creates a worker thread that generates tokens. When the completion
// handle is dropped, tokens stop generating!

let mut completions = ctx.start_completing();

while let Some(next_token) = completions.next_token() {
    println!("{}", String::from_utf8_lossy(next_token.as_bytes()));

    decoded_tokens += 1;

    if decoded_tokens > max_tokens {
        break;
    }
}

Dependencies

This crate depends on (and builds atop) llama_cpp_sys, and builds llama.cpp from source. You’ll need libclang, cmake, and a C/C++ toolchain (clang is preferred) at the minimum. See llama_cpp_sys for more details.

The bundled GGML and llama.cpp binaries are statically linked by default, and their logs are re-routed through tracing instead of stderr. If you’re getting stuck, setting up tracing for more debug information should be at the top of your troubleshooting list!

CompletionHandle
A handle (and channel) to an ongoing completion job on an off thread.
CompletionToken
An intermediate token generated during an LLM completion.
LlamaInternalError
An error that occurred on the other side of the C FFI boundary.
LlamaModel
A llama.cpp model.
LlamaSession
An evaluation session for a llama.cpp model.
Token
A single token produced or consumed by a LlamaModel, without its associated context.

Enums

LlamaContextError
An error raised while advancing the context in a LlamaSession.
LlamaLoadError
An error raised while loading a llama.cpp model.
LlamaTokenizationError
An error raised while tokenizing some input for a model.

Crate llama_cpp

Dependencies

Undefined Behavior / Panic Safety

Minimum Stable Rust Version (MSRV) Policy

License

Structs

Enums