Expand description
High-level bindings to llama.cpp’s C API, providing a predictable, safe, and high-performance medium for interacting with Large Language Models (LLMs) on consumer-grade hardware.
Along with llama.cpp, this crate is still in an early state, and breaking changes may occur between versions. The high-level API, however, is fairly settled on.
To get started, create a LlamaModel
and a LlamaSession
:
use llama_cpp::{LlamaModel, LlamaParams, SessionParams};
use llama_cpp::standard_sampler::StandardSampler;
// Create a model from anything that implements `AsRef<Path>`:
let model = LlamaModel::load_from_file("path_to_model.gguf", LlamaParams::default()).expect("Could not load model");
// A `LlamaModel` holds the weights shared across many _sessions_; while your model may be
// several gigabytes large, a session is typically a few dozen to a hundred megabytes!
let mut ctx = model.create_session(SessionParams::default()).expect("Failed to create session");
// You can feed anything that implements `AsRef<[u8]>` into the model's context.
ctx.advance_context("This is the story of a man named Stanley.").unwrap();
// LLMs are typically used to predict the next word in a sequence. Let's generate some tokens!
let max_tokens = 1024;
let mut decoded_tokens = 0;
// `ctx.start_completing_with` creates a worker thread that generates tokens. When the completion
// handle is dropped, tokens stop generating!
let mut completions = ctx.start_completing_with(StandardSampler::default(), 1024).into_strings();
for completion in completions {
print!("{completion}");
let _ = io::stdout().flush();
decoded_tokens += 1;
if decoded_tokens > max_tokens {
break;
}
}
§Dependencies
This crate depends on (and builds atop) llama_cpp_sys
, and builds llama.cpp from source.
You’ll need at least libclang
and a C/C++ toolchain (clang
is preferred).
See llama_cpp_sys
for more details.
The bundled GGML and llama.cpp binaries are statically linked by default, and their logs
are re-routed through tracing
instead of stderr
.
If you’re getting stuck, setting up tracing
for more debug information should
be at the top of your troubleshooting list!
§Undefined Behavior / Panic Safety
It should be impossible to trigger undefined behavior from this crate, and any
UB is considered a critical bug. UB triggered downstream in llama.cpp or
ggml
should have issues filed and mirrored in llama_cpp-rs
’s issue tracker.
While panics are considered less critical, this crate should never panic, and any panic should be considered a bug. We don’t want your control flow!
§Building
Keep in mind that llama.cpp is very computationally heavy, meaning standard
debug builds (running just cargo build
/cargo run
) will suffer greatly from the lack of optimisations. Therefore, unless
debugging is really necessary, it is highly recommended to build and run using Cargo’s --release
flag.
§Minimum Stable Rust Version (MSRV) Policy
This crates supports Rust 1.73.0 and above.
§License
MIT or Apache 2.0 (the “Rust” license), at your option.
Modules§
- grammar
- The grammar module contains the grammar parser and the grammar struct.
- standard_
sampler - The standard sampler implementation.
Structs§
- Completion
Handle - A handle (and channel) to an ongoing completion job on an off thread.
- Embeddings
Params - Embeddings inference specific parameters.
- Llama
Internal Error - An error that occurred on the other side of the C FFI boundary.
- Llama
Model - A llama.cpp model.
- Llama
Params - Parameters for llama.
- Llama
Session - An evaluation session for a llama.cpp model.
- Resource
Usage - Memory requirements for something.
- Session
Params - Session-specific parameters.
- Token
- A single token produced or consumed by a
LlamaModel
, without its associated context. - Tokens
ToBytes - A wrapper struct around an iterator or stream of tokens, yielding
Vec<u8>
byte pieces for each token. - Tokens
ToStrings - A wrapper struct around a
CompletionHandle
, yieldingString
tokens for each byte piece of the model’s output.
Enums§
- Cache
Type - The type of key or value in the cache.
- Llama
Context Error - An error raised while advancing the context in a
LlamaSession
. - Llama
Load Error - An error raised while loading a llama.cpp model.
- Llama
Tokenization Error - An error raised while tokenizing some input for a model.
- Pooling
Type - whether to pool (sum) embedding results by sequence id (ignored if no pooling layer)
- Rope
Scaling - A rope scaling type.
- Split
Mode - A policy to split the model across multiple GPUs
Traits§
- Sampler
- Something which selects a
Token
from the distribution output by aLlamaModel
.