1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
//! High-level bindings to [llama.cpp][llama.cpp]'s C API, providing a predictable, safe, and
//! high-performance medium for interacting with Large Language Models (LLMs) on consumer-grade
//! hardware.
//!
//! **Along with llama.cpp, this crate is still in an early state, and breaking changes may occur
//! between versions.** The high-level API, however, is fairly settled on.
//!
//! To get started, create a [`LlamaModel`] and a [`LlamaSession`]:
//!
//! ```no_run
//! use llama_cpp::{LlamaModel, LlamaParams, SessionParams};
//! use llama_cpp::standard_sampler::StandardSampler;
//!
//! // Create a model from anything that implements `AsRef<Path>`:
//! let model = LlamaModel::load_from_file("path_to_model.gguf", LlamaParams::default()).expect("Could not load model");
//!
//! // A `LlamaModel` holds the weights shared across many _sessions_; while your model may be
//! // several gigabytes large, a session is typically a few dozen to a hundred megabytes!
//! let mut ctx = model.create_session(SessionParams::default()).expect("Failed to create session");
//!
//! // You can feed anything that implements `AsRef<[u8]>` into the model's context.
//! ctx.advance_context("This is the story of a man named Stanley.").unwrap();
//!
//! // LLMs are typically used to predict the next word in a sequence. Let's generate some tokens!
//! let max_tokens = 1024;
//! let mut decoded_tokens = 0;
//!
//! // `ctx.start_completing_with` creates a worker thread that generates tokens. When the completion
//! // handle is dropped, tokens stop generating!
//!
//! let mut completions = ctx.start_completing_with(StandardSampler::default(), 1024).into_strings();
//!
//! for completion in completions {
//! print!("{completion}");
//! let _ = io::stdout().flush();
//!
//! decoded_tokens += 1;
//!
//! if decoded_tokens > max_tokens {
//! break;
//! }
//! }
//! ```
//!
//! ## Dependencies
//!
//! This crate depends on (and builds atop) [`llama_cpp_sys`], and builds llama.cpp from source.
//! You'll need at least `libclang` and a C/C++ toolchain (`clang` is preferred).
//! See [`llama_cpp_sys`] for more details.
//!
//! The bundled GGML and llama.cpp binaries are statically linked by default, and their logs
//! are re-routed through [`tracing`][tracing] instead of `stderr`.
//! If you're getting stuck, setting up [`tracing`][tracing] for more debug information should
//! be at the top of your troubleshooting list!
//!
//! ## Undefined Behavior / Panic Safety
//!
//! It should be **impossible** to trigger [undefined behavior][ub] from this crate, and any
//! UB is considered a critical bug. UB triggered downstream in [llama.cpp][llama.cpp] or
//! [`ggml`][ggml] should have issues filed and mirrored in `llama_cpp-rs`'s issue tracker.
//!
//! While panics are considered less critical, **this crate should never panic**, and any
//! panic should be considered a bug. We don't want your control flow!
//!
//! ## Building
//!
//! Keep in mind that [llama.cpp][llama.cpp] is very computationally heavy, meaning standard
//! debug builds (running just `cargo build`/`cargo run`) will suffer greatly from the lack of optimisations. Therefore, unless
//! debugging is really necessary, it is highly recommended to build and run using Cargo's `--release` flag.
//!
//! ## Minimum Stable Rust Version (MSRV) Policy
//!
//! This crates supports Rust 1.73.0 and above.
//!
//! ## License
//!
//! MIT or Apache 2.0 (the "Rust" license), at your option.
//!
//! [ub]: https://doc.rust-lang.org/reference/behavior-considered-undefined.html
//! [tracing]: https://docs.rs/tracing/latest/tracing/
//! [ggml]: https://github.com/ggerganov/ggml/
//! [llama.cpp]: https://github.com/ggerganov/llama.cpp/
use ;
use Error;
pub use *;
pub use *;
/// The standard sampler implementation.
/// A single token produced or consumed by a [`LlamaModel`], without its associated context.
///
/// Due to the layout of llama.cpp, these can be _created_ from a [`LlamaModel`], but require a
/// [`LlamaSession`] to decode.
///
/// On its own, this isn't useful for anything other than being fed into
/// [`LlamaSession::advance_context_with_tokens`].
;
/// An error that occurred on the other side of the C FFI boundary.
///
/// GGML and llama.cpp typically log useful information before failing, which is forwarded to this
/// crate's [`tracing`] handler.
///
/// [tracing]: https://docs.rs/tracing/latest/tracing/
;
/// Something which selects a [`Token`] from the distribution output by a
/// [`LlamaModel`].
/// Memory requirements for something.
///
/// This is typically returned by [`LlamaModel::estimate_session_size`] and
/// [`LlamaModel::estimate_embeddings_session_size`] as an estimation of memory usage.