1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
//! # rig-llama-cpp
//!
//! A [Rig](https://docs.rs/rig-core) provider that runs GGUF models locally
//! via [llama.cpp](https://github.com/ggml-org/llama.cpp), with optional Vulkan GPU acceleration.
//!
//! This crate implements Rig's [`rig::completion::CompletionModel`] and [`rig::embeddings::EmbeddingModel`] traits
//! so that any GGUF model can be used as a drop-in replacement for cloud-based providers. It supports:
//!
//! - **Completion and streaming** — both one-shot and token-by-token responses.
//! - **Tool calling** — models with OpenAI-compatible chat templates can invoke tools.
//! - **Reasoning / thinking** — extended thinking output is forwarded when the model supports it.
//! - **Configurable sampling** — top-p, top-k, min-p, temperature, presence and repetition penalties.
//! - **Embeddings** — generate text embeddings using GGUF embedding models.
//!
//! # Feature flags
//!
//! There is **no default GPU backend** — pick exactly the one that matches
//! your hardware. With no feature enabled the build is CPU-only.
//!
//! GPU backends (forwarded to `llama-cpp-2`):
//!
//! - `vulkan` — cross-vendor GPU (recommended on Linux/Windows when CUDA/ROCm aren't set up).
//! - `cuda` — NVIDIA GPUs with the CUDA toolkit installed.
//! - `metal` — Apple Silicon / macOS.
//! - `rocm` — AMD GPUs on Linux with the ROCm toolchain.
//!
//! Other:
//!
//! - `openmp` — OpenMP CPU threading; orthogonal to the GPU backends and may be combined with any of them.
//! - `mtmd` — multimodal (vision) inference; required for `Client::from_gguf_with_mmproj` and `ClientBuilder::mmproj`.
//!
//! Examples:
//!
//! ```text
//! cargo build --features vulkan
//! cargo build --features cuda
//! cargo build --features "vulkan,mtmd"
//! ```
//!
//! Backend support depends on the corresponding `llama-cpp-2` feature and any required
//! native toolchain or system libraries being available on the host machine.
//!
//! # Quick start
//!
//! ```rust,no_run
//! use rig::client::CompletionClient;
//! use rig::completion::Prompt;
//!
//! # #[tokio::main]
//! # async fn main() -> Result<(), Box<dyn std::error::Error>> {
//! let client = rig_llama_cpp::Client::builder("path/to/model.gguf")
//! .n_ctx(8192)
//! .build()?;
//!
//! let agent = client
//! .agent("local")
//! .preamble("You are a helpful assistant.")
//! .max_tokens(512)
//! .build();
//!
//! let response = agent.prompt("Hello!").await?;
//! println!("{response}");
//! # Ok(())
//! # }
//! ```
pub use ;
pub use ;
pub use LoadError;
pub use ;
/// Whether to forward llama.cpp's *C-side* logging to stderr.
///
/// This only controls log lines that originate inside the `llama-cpp-2` /
/// `llama-cpp-sys-2` C++ code (via `printf`-style writes that bypass Rust's
/// `log` facade). Library-level diagnostics from `rig-llama-cpp` itself go
/// through the [`log`] crate and are controlled by the consumer's logger
/// configuration (e.g. `RUST_LOG=rig_llama_cpp=debug`), not this env var.
/// Process-wide [`LlamaBackend`] initialised on first use and shared by every
/// worker (chat + embedding). The underlying llama.cpp backend is a global
/// singleton — calling `LlamaBackend::init()` twice in the same process
/// returns `BackendAlreadyInitialized`. Routing all callers through this
/// helper means a chat client and an embedding client can coexist without
/// racing on the C-side init flag.
///
/// Returns `Ok(&'static LlamaBackend)` once the backend is up; subsequent
/// calls are cheap (single `OnceLock::get`). On platforms where init can
/// fail (e.g. no Vulkan device) the error is sticky for the lifetime of
/// the process — there's no recovering anyway.
pub