Expand description
High-level bindings to llama.cpp’s C API, providing a predictable, safe, and high-performance medium for interacting with Large Language Models (LLMs) on consumer-grade hardware.
Along with llama.cpp, this crate is still in an early state, and breaking changes may occur between versions. The high-level API, however, is fairly settled on.
To get started, create a LlamaModel
and a LlamaSession
:
use llama_cpp::{LlamaModel, LlamaParams, SessionParams};
use llama_cpp::standard_sampler::StandardSampler;
// Create a model from anything that implements `AsRef<Path>`:
let model = LlamaModel::load_from_file("path_to_model.gguf", LlamaParams::default()).expect("Could not load model");
// A `LlamaModel` holds the weights shared across many _sessions_; while your model may be
// several gigabytes large, a session is typically a few dozen to a hundred megabytes!
let mut ctx = model.create_session(SessionParams::default()).expect("Failed to create session");
// You can feed anything that implements `AsRef<[u8]>` into the model's context.
ctx.advance_context("This is the story of a man named Stanley.").unwrap();
// LLMs are typically used to predict the next word in a sequence. Let's generate some tokens!
let max_tokens = 1024;
let mut decoded_tokens = 0;
// `ctx.start_completing_with` creates a worker thread that generates tokens. When the completion
// handle is dropped, tokens stop generating!
let mut completions = ctx.start_completing_with(StandardSampler::default(), 1024).into_strings();
for completion in completions {
print!("{completion}");
let _ = io::stdout().flush();
decoded_tokens += 1;
if decoded_tokens > max_tokens {
break;
}
}
§Dependencies
This crate depends on (and builds atop) llama_cpp_sys
, and builds llama.cpp from source.
You’ll need libclang
, cmake
, and a C/C++ toolchain (clang
is preferred) at the minimum.
See llama_cpp_sys
for more details.
The bundled GGML and llama.cpp binaries are statically linked by default, and their logs
are re-routed through tracing
instead of stderr
.
If you’re getting stuck, setting up tracing
for more debug information should
be at the top of your troubleshooting list!
§Undefined Behavior / Panic Safety
It should be impossible to trigger undefined behavior from this crate, and any
UB is considered a critical bug. UB triggered downstream in llama.cpp or
ggml
should have issues filed and mirrored in llama_cpp-rs
’s issue tracker.
While panics are considered less critical, this crate should never panic, and any panic should be considered a bug. We don’t want your control flow!
§Minimum Stable Rust Version (MSRV) Policy
This crates supports Rust 1.73.0 and above.
§License
MIT or Apache 2.0 (the “Rust” license), at your option.
Modules§
- The standard sampler implementation.
Structs§
- A handle (and channel) to an ongoing completion job on an off thread.
- Embeddings inference specific parameters.
- An error that occurred on the other side of the C FFI boundary.
- A llama.cpp model.
- Parameters for llama.
- An evaluation session for a llama.cpp model.
- Session-specific parameters.
- A single token produced or consumed by a
LlamaModel
, without its associated context. - A wrapper struct around an iterator or stream of tokens, yielding
Vec<u8>
byte pieces for each token. - A wrapper struct around a
CompletionHandle
, yieldingString
tokens for each byte piece of the model’s output.
Enums§
- An error raised while advancing the context in a
LlamaSession
. - An error raised while loading a llama.cpp model.
- An error raised while tokenizing some input for a model.
- A policy to split the model across multiple GPUs
Traits§
- This needs to be documented!