Expand description
High-level bindings to llama.cpp’s C API, providing a predictable, safe, and high-performance medium for interacting with Large Language Models (LLMs) on consumer-grade hardware.
Along with llama.cpp, his crate is still in an early state, and breaking changes may occur between versions. The high-level API, however, is fairly settled on.
To get started, create a LlamaModel
and a LlamaSession
:
use llama_cpp::LlamaModel;
// Create a model from anything that implements `AsRef<Path>`:
let model = LlamaModel::load_from_file("path_to_model.gguf").expect("Could not load model");
// A `LlamaModel` holds the weights shared across many _sessions_; while your model may be
// several gigabytes large, a session is typically a few dozen to a hundred megabytes!
let mut ctx = model.create_session();
// You can feed anything that implements `AsRef<[u8]>` into the model's context.
ctx.advance_context("This is the story of a man named Stanley.").unwrap();
// LLMs are typically used to predict the next word in a sequence. Let's generate some tokens!
let max_tokens = 1024;
let mut decoded_tokens = 0;
// `ctx.get_completions` creates a worker thread that generates tokens. When the completion
// handle is dropped, tokens stop generating!
let mut completions = ctx.start_completing();
while let Some(next_token) = completions.next_token() {
println!("{}", String::from_utf8_lossy(next_token.as_bytes()));
decoded_tokens += 1;
if decoded_tokens > max_tokens {
break;
}
}
Dependencies
This crate depends on (and builds atop) llama_cpp_sys
, and builds llama.cpp from source.
You’ll need libclang
, cmake
, and a C/C++ toolchain (clang
is preferred) at the minimum.
See llama_cpp_sys
for more details.
The bundled GGML and llama.cpp binaries are statically linked by default, and their logs
are re-routed through tracing
instead of stderr
.
If you’re getting stuck, setting up tracing
for more debug information should
be at the top of your troubleshooting list!
Undefined Behavior / Panic Safety
It should be impossible to trigger undefined behavior from this crate, and any
UB is considered a critical bug. UB triggered downstream in llama.cpp or
ggml
should have issues filed and mirrored in llama_cpp-rs
’s issue tracker.
While panics are considered less critical, this crate should never panic, and any panic should be considered a bug. We don’t want your control flow!
Minimum Stable Rust Version (MSRV) Policy
This crates supports Rust 1.73.0 and above.
License
MIT or Apache 2.0 (the “Rust” license), at your option.
Structs
- A handle (and channel) to an ongoing completion job on an off thread.
- An intermediate token generated during an LLM completion.
- An error that occurred on the other side of the C FFI boundary.
- A llama.cpp model.
- An evaluation session for a llama.cpp model.
- A single token produced or consumed by a
LlamaModel
, without its associated context.
Enums
- An error raised while advancing the context in a
LlamaSession
. - An error raised while loading a llama.cpp model.
- An error raised while tokenizing some input for a model.