pub struct Client { /* private fields */ }Expand description
The llama.cpp completion client.
Client loads a GGUF model on a dedicated inference thread and exposes it
through Rig’s CompletionClient trait. Construct one with
Client::builder, or — for backward-compatible positional construction —
Client::from_gguf.
§Lifecycle
The worker thread owns the LlamaModel, LlamaContext, and (when the
mtmd feature is on) the multimodal projector. It only releases that
memory when it exits, which happens in two cases:
- On
Client::reload, the worker drops the old model and loads the new one in place — theClientitself is not dropped, and the worker thread is reused. Caller blocks on the reload result. - On
Client::drop, the worker thread is signalled and joined. Seeimpl Drop for Clientfor the exact semantics — including how a long in-flight generation is cancelled so the dropping thread doesn’t have to wait for it to finish naturally.
Implementations§
Source§impl Client
impl Client
Sourcepub fn builder(model_path: impl Into<String>) -> ClientBuilder
pub fn builder(model_path: impl Into<String>) -> ClientBuilder
Start a ClientBuilder for a GGUF model at model_path.
Sourcepub fn from_gguf(
model_path: impl Into<String>,
n_ctx: u32,
sampling_params: SamplingParams,
fit_params: FitParams,
kv_cache_params: KvCacheParams,
checkpoint_params: CheckpointParams,
) -> Result<Self, LoadError>
pub fn from_gguf( model_path: impl Into<String>, n_ctx: u32, sampling_params: SamplingParams, fit_params: FitParams, kv_cache_params: KvCacheParams, checkpoint_params: CheckpointParams, ) -> Result<Self, LoadError>
Load a GGUF model with automatic GPU/CPU layer fitting and start the inference worker thread.
llama.cpp will probe available device memory and determine the optimal layer distribution automatically.
Prefer Client::builder for new code — this constructor is kept for
backward compatibility with the positional 0.1.x API and forwards
directly to the builder.
§Arguments
model_path— Path to a.ggufmodel file.n_ctx— Desired context window size in tokens.sampling_params— Sampling parameters for token generation.fit_params— Configuration for the fitting algorithm.kv_cache_params— KV cache data-type configuration (defaults to F16/F16).checkpoint_params— Tunables for the in-memory state-checkpoint cache used to preserve KV/recurrent state across chat turns for hybrid models.
§Errors
Returns a LoadError if the backend fails to initialise, automatic
fitting fails, or the model cannot be loaded.
Sourcepub fn from_gguf_with_mmproj(
model_path: impl Into<String>,
mmproj_path: impl Into<String>,
n_ctx: u32,
sampling_params: SamplingParams,
fit_params: FitParams,
kv_cache_params: KvCacheParams,
checkpoint_params: CheckpointParams,
) -> Result<Self, LoadError>
pub fn from_gguf_with_mmproj( model_path: impl Into<String>, mmproj_path: impl Into<String>, n_ctx: u32, sampling_params: SamplingParams, fit_params: FitParams, kv_cache_params: KvCacheParams, checkpoint_params: CheckpointParams, ) -> Result<Self, LoadError>
Load a GGUF vision model with a multimodal projector and automatic GPU/CPU layer fitting.
This constructor enables multimodal (vision) inference. The mmproj_path should point
to a GGUF multimodal projector file (mmproj) that corresponds to the vision model.
Prefer Client::builder with ClientBuilder::mmproj for new code.
§Arguments
model_path— Path to a.ggufvision model file.mmproj_path— Path to the corresponding multimodal projector.gguffile.n_ctx— Desired context window size in tokens.sampling_params— Sampling parameters for token generation.fit_params— Configuration for the fitting algorithm.kv_cache_params— KV cache data-type configuration (defaults to F16/F16).
§Errors
Returns a LoadError if the backend fails to initialise, the model
cannot be loaded, or the multimodal projector cannot be initialised.
Sourcepub fn reload(
&self,
model_path: String,
mmproj_path: Option<String>,
n_ctx: u32,
sampling: SamplingParams,
fit_params: FitParams,
kv_cache_params: KvCacheParams,
checkpoint_params: CheckpointParams,
) -> Result<(), LoadError>
pub fn reload( &self, model_path: String, mmproj_path: Option<String>, n_ctx: u32, sampling: SamplingParams, fit_params: FitParams, kv_cache_params: KvCacheParams, checkpoint_params: CheckpointParams, ) -> Result<(), LoadError>
Reload the worker thread with a new model without destroying the backend.
This swaps the model in-place on the existing inference thread, avoiding the
LlamaBackend singleton re-initialization race that occurs when dropping and
recreating a Client.
§Errors
Returns LoadError::WorkerNotRunning if the inference worker is no
longer accepting commands, or any of the load-stage variants if the
new model fails to come up.
Trait Implementations§
Source§impl CompletionClient for Client
impl CompletionClient for Client
Source§type CompletionModel = Model
type CompletionModel = Model
Source§fn completion_model(&self, model: impl Into<String>) -> Self::CompletionModel
fn completion_model(&self, model: impl Into<String>) -> Self::CompletionModel
Source§fn agent(&self, model: impl Into<String>) -> AgentBuilder<Self::CompletionModel>
fn agent(&self, model: impl Into<String>) -> AgentBuilder<Self::CompletionModel>
Source§fn extractor<T>(
&self,
model: impl Into<String>,
) -> ExtractorBuilder<Self::CompletionModel, T>
fn extractor<T>( &self, model: impl Into<String>, ) -> ExtractorBuilder<Self::CompletionModel, T>
Source§impl Drop for Client
impl Drop for Client
Source§fn drop(&mut self)
fn drop(&mut self)
Tear down the worker thread synchronously.
Drop blocks until the worker thread has fully exited and the
LlamaModel / LlamaContext (and LlamaBackend device handles, plus
the multimodal projector when the mtmd feature is on) are released.
This is intentional: the caller almost always wants to allocate a
replacement Client immediately after dropping this one, and a
non-blocking drop would briefly hold 2× the model’s RAM/VRAM and risk
OOM. Client::reload reuses the same worker and avoids this whole
path; prefer it over drop-and-recreate when you can.
To keep the wait short even when a long generation is mid-flight,
Drop flips the shared cancel flag before signalling shutdown. The
worker polls the flag at every prompt-prefill chunk boundary and
every sampled token, so an in-flight Request returns within a
single decode step. The pessimal wait is therefore one decode step,
not the rest of the generation.
try_send(Shutdown) is best-effort: if the bounded command queue is
full at this instant, the Shutdown command isn’t enqueued — but the
in-flight request still bails on the cancel flag, and the worker’s
per-iteration cancel check at the top of its command loop also exits
the thread before pulling more queued commands.
Model clones outliving the Client keep the channel sender count
above zero; their send calls naturally fail with SendError once
the receiver is dropped on worker exit, so they don’t prevent
shutdown.