Crate llama_cpp

Source
Expand description

High-level bindings to llama.cpp’s C API, providing a predictable, safe, and high-performance medium for interacting with Large Language Models (LLMs) on consumer-grade hardware.

Along with llama.cpp, this crate is still in an early state, and breaking changes may occur between versions. The high-level API, however, is fairly settled on.

To get started, create a LlamaModel and a LlamaSession:

use llama_cpp::{LlamaModel, LlamaParams, SessionParams};
use llama_cpp::standard_sampler::StandardSampler;

// Create a model from anything that implements `AsRef<Path>`:
let model = LlamaModel::load_from_file("path_to_model.gguf", LlamaParams::default()).expect("Could not load model");

// A `LlamaModel` holds the weights shared across many _sessions_; while your model may be
// several gigabytes large, a session is typically a few dozen to a hundred megabytes!
let mut ctx = model.create_session(SessionParams::default()).expect("Failed to create session");

// You can feed anything that implements `AsRef<[u8]>` into the model's context.
ctx.advance_context("This is the story of a man named Stanley.").unwrap();

// LLMs are typically used to predict the next word in a sequence. Let's generate some tokens!
let max_tokens = 1024;
let mut decoded_tokens = 0;

// `ctx.start_completing_with` creates a worker thread that generates tokens. When the completion
// handle is dropped, tokens stop generating!

let mut completions = ctx.start_completing_with(StandardSampler::default(), 1024).into_strings();

for completion in completions {
    print!("{completion}");
    let _ = io::stdout().flush();

    decoded_tokens += 1;

    if decoded_tokens > max_tokens {
        break;
    }
}

§Dependencies

This crate depends on (and builds atop) llama_cpp_sys, and builds llama.cpp from source. You’ll need at least libclang and a C/C++ toolchain (clang is preferred). See llama_cpp_sys for more details.

The bundled GGML and llama.cpp binaries are statically linked by default, and their logs are re-routed through tracing instead of stderr. If you’re getting stuck, setting up tracing for more debug information should be at the top of your troubleshooting list!

§Undefined Behavior / Panic Safety

It should be impossible to trigger undefined behavior from this crate, and any UB is considered a critical bug. UB triggered downstream in llama.cpp or ggml should have issues filed and mirrored in llama_cpp-rs’s issue tracker.

While panics are considered less critical, this crate should never panic, and any panic should be considered a bug. We don’t want your control flow!

§Building

Keep in mind that llama.cpp is very computationally heavy, meaning standard debug builds (running just cargo build/cargo run) will suffer greatly from the lack of optimisations. Therefore, unless debugging is really necessary, it is highly recommended to build and run using Cargo’s --release flag.

§Minimum Stable Rust Version (MSRV) Policy

This crates supports Rust 1.73.0 and above.

§License

MIT or Apache 2.0 (the “Rust” license), at your option.

Modules§

grammar
The grammar module contains the grammar parser and the grammar struct.
standard_sampler
The standard sampler implementation.

Structs§

CompletionHandle
A handle (and channel) to an ongoing completion job on an off thread.
EmbeddingsParams
Embeddings inference specific parameters.
LlamaInternalError
An error that occurred on the other side of the C FFI boundary.
LlamaModel
A llama.cpp model.
LlamaParams
Parameters for llama.
LlamaSession
An evaluation session for a llama.cpp model.
ResourceUsage
Memory requirements for something.
SessionParams
Session-specific parameters.
Token
A single token produced or consumed by a LlamaModel, without its associated context.
TokensToBytes
A wrapper struct around an iterator or stream of tokens, yielding Vec<u8> byte pieces for each token.
TokensToStrings
A wrapper struct around a CompletionHandle, yielding String tokens for each byte piece of the model’s output.

Enums§

CacheType
The type of key or value in the cache.
LlamaContextError
An error raised while advancing the context in a LlamaSession.
LlamaLoadError
An error raised while loading a llama.cpp model.
LlamaTokenizationError
An error raised while tokenizing some input for a model.
PoolingType
whether to pool (sum) embedding results by sequence id (ignored if no pooling layer)
RopeScaling
A rope scaling type.
SplitMode
A policy to split the model across multiple GPUs

Traits§

Sampler
Something which selects a Token from the distribution output by a LlamaModel.