Crate llama_cpp

Expand description

High-level bindings to llama.cpp’s C API, providing a predictable, safe, and high-performance medium for interacting with Large Language Models (LLMs) on consumer-grade hardware.

Along with llama.cpp, this crate is still in an early state, and breaking changes may occur between versions. The high-level API, however, is fairly settled on.

To get started, create a LlamaModel and a LlamaSession:

use llama_cpp::{LlamaModel, LlamaParams, SessionParams};
use llama_cpp::standard_sampler::StandardSampler;

// Create a model from anything that implements `AsRef<Path>`:
let model = LlamaModel::load_from_file("path_to_model.gguf", LlamaParams::default()).expect("Could not load model");

// A `LlamaModel` holds the weights shared across many _sessions_; while your model may be
// several gigabytes large, a session is typically a few dozen to a hundred megabytes!
let mut ctx = model.create_session(SessionParams::default()).expect("Failed to create session");

// You can feed anything that implements `AsRef<[u8]>` into the model's context.
ctx.advance_context("This is the story of a man named Stanley.").unwrap();

// LLMs are typically used to predict the next word in a sequence. Let's generate some tokens!
let max_tokens = 1024;
let mut decoded_tokens = 0;

// `ctx.start_completing_with` creates a worker thread that generates tokens. When the completion
// handle is dropped, tokens stop generating!

let mut completions = ctx.start_completing_with(StandardSampler::default(), 1024).into_strings();

for completion in completions {
    print!("{completion}");
    let _ = io::stdout().flush();

    decoded_tokens += 1;

    if decoded_tokens > max_tokens {
        break;
    }
}

§Dependencies

This crate depends on (and builds atop) llama_cpp_sys, and builds llama.cpp from source. You’ll need at least libclang and a C/C++ toolchain (clang is preferred). See llama_cpp_sys for more details.

The bundled GGML and llama.cpp binaries are statically linked by default, and their logs are re-routed through tracing instead of stderr. If you’re getting stuck, setting up tracing for more debug information should be at the top of your troubleshooting list!

§Undefined Behavior / Panic Safety

It should be impossible to trigger undefined behavior from this crate, and any UB is considered a critical bug. UB triggered downstream in llama.cpp or ggml should have issues filed and mirrored in llama_cpp-rs’s issue tracker.

While panics are considered less critical, this crate should never panic, and any panic should be considered a bug. We don’t want your control flow!

§Building

Keep in mind that llama.cpp is very computationally heavy, meaning standard debug builds (running just cargo build/cargo run) will suffer greatly from the lack of optimisations. Therefore, unless debugging is really necessary, it is highly recommended to build and run using Cargo’s --release flag.

§Minimum Stable Rust Version (MSRV) Policy

This crates supports Rust 1.73.0 and above.

§License

MIT or Apache 2.0 (the “Rust” license), at your option.

Modules§

grammar: The grammar module contains the grammar parser and the grammar struct.
standard_sampler: The standard sampler implementation.

Structs§

CompletionHandle: A handle (and channel) to an ongoing completion job on an off thread.
EmbeddingsParams: Embeddings inference specific parameters.
LlamaInternalError: An error that occurred on the other side of the C FFI boundary.
LlamaModel: A llama.cpp model.
LlamaParams: Parameters for llama.
LlamaSession: An evaluation session for a llama.cpp model.
ResourceUsage: Memory requirements for something.
SessionParams: Session-specific parameters.
Token: A single token produced or consumed by a LlamaModel, without its associated context.
TokensToBytes: A wrapper struct around an iterator or stream of tokens, yielding Vec<u8> byte pieces for each token.
TokensToStrings: A wrapper struct around a CompletionHandle, yielding String tokens for each byte piece of the model’s output.

Enums§

CacheType: The type of key or value in the cache.
LlamaContextError: An error raised while advancing the context in a LlamaSession.
LlamaLoadError: An error raised while loading a llama.cpp model.
LlamaTokenizationError: An error raised while tokenizing some input for a model.
PoolingType: whether to pool (sum) embedding results by sequence id (ignored if no pooling layer)
RopeScaling: A rope scaling type.
SplitMode: A policy to split the model across multiple GPUs

Traits§

Sampler: Something which selects a Token from the distribution output by a LlamaModel.

Crate llama_cppCopy item path