Skip to main content

Client

Struct Client 

Source
pub struct Client { /* private fields */ }
Expand description

The llama.cpp completion client.

Client loads a GGUF model on a dedicated inference thread and exposes it through Rig’s CompletionClient trait. Construct one with Client::builder, or — for backward-compatible positional construction — Client::from_gguf.

§Lifecycle

The worker thread owns the LlamaModel, LlamaContext, and (when the mtmd feature is on) the multimodal projector. It only releases that memory when it exits, which happens in two cases:

  • On Client::reload, the worker drops the old model and loads the new one in place — the Client itself is not dropped, and the worker thread is reused. Caller blocks on the reload result.
  • On Client::drop, the worker thread is signalled and joined. See impl Drop for Client for the exact semantics — including how a long in-flight generation is cancelled so the dropping thread doesn’t have to wait for it to finish naturally.

Implementations§

Source§

impl Client

Source

pub fn builder(model_path: impl Into<String>) -> ClientBuilder

Start a ClientBuilder for a GGUF model at model_path.

Source

pub fn from_gguf( model_path: impl Into<String>, n_ctx: u32, sampling_params: SamplingParams, fit_params: FitParams, kv_cache_params: KvCacheParams, checkpoint_params: CheckpointParams, ) -> Result<Self, LoadError>

Load a GGUF model with automatic GPU/CPU layer fitting and start the inference worker thread.

llama.cpp will probe available device memory and determine the optimal layer distribution automatically.

Prefer Client::builder for new code — this constructor is kept for backward compatibility with the positional 0.1.x API and forwards directly to the builder.

§Arguments
  • model_path — Path to a .gguf model file.
  • n_ctx — Desired context window size in tokens.
  • sampling_params — Sampling parameters for token generation.
  • fit_params — Configuration for the fitting algorithm.
  • kv_cache_params — KV cache data-type configuration (defaults to F16/F16).
  • checkpoint_params — Tunables for the in-memory state-checkpoint cache used to preserve KV/recurrent state across chat turns for hybrid models.
§Errors

Returns a LoadError if the backend fails to initialise, automatic fitting fails, or the model cannot be loaded.

Source

pub fn from_gguf_with_mmproj( model_path: impl Into<String>, mmproj_path: impl Into<String>, n_ctx: u32, sampling_params: SamplingParams, fit_params: FitParams, kv_cache_params: KvCacheParams, checkpoint_params: CheckpointParams, ) -> Result<Self, LoadError>

Load a GGUF vision model with a multimodal projector and automatic GPU/CPU layer fitting.

This constructor enables multimodal (vision) inference. The mmproj_path should point to a GGUF multimodal projector file (mmproj) that corresponds to the vision model.

Prefer Client::builder with ClientBuilder::mmproj for new code.

§Arguments
  • model_path — Path to a .gguf vision model file.
  • mmproj_path — Path to the corresponding multimodal projector .gguf file.
  • n_ctx — Desired context window size in tokens.
  • sampling_params — Sampling parameters for token generation.
  • fit_params — Configuration for the fitting algorithm.
  • kv_cache_params — KV cache data-type configuration (defaults to F16/F16).
§Errors

Returns a LoadError if the backend fails to initialise, the model cannot be loaded, or the multimodal projector cannot be initialised.

Source

pub fn reload( &self, model_path: String, mmproj_path: Option<String>, n_ctx: u32, sampling: SamplingParams, fit_params: FitParams, kv_cache_params: KvCacheParams, checkpoint_params: CheckpointParams, ) -> Result<(), LoadError>

Reload the worker thread with a new model without destroying the backend.

This swaps the model in-place on the existing inference thread, avoiding the LlamaBackend singleton re-initialization race that occurs when dropping and recreating a Client.

§Errors

Returns LoadError::WorkerNotRunning if the inference worker is no longer accepting commands, or any of the load-stage variants if the new model fails to come up.

Trait Implementations§

Source§

impl CompletionClient for Client

Source§

type CompletionModel = Model

The type of CompletionModel used by the client.
Source§

fn completion_model(&self, model: impl Into<String>) -> Self::CompletionModel

Create a completion model with the given model. Read more
Source§

fn agent(&self, model: impl Into<String>) -> AgentBuilder<Self::CompletionModel>

Create an agent builder with the given completion model. Read more
Source§

fn extractor<T>( &self, model: impl Into<String>, ) -> ExtractorBuilder<Self::CompletionModel, T>
where T: JsonSchema + for<'a> Deserialize<'a> + Serialize + Send + Sync,

Create an extractor builder with the given completion model.
Source§

impl Drop for Client

Source§

fn drop(&mut self)

Tear down the worker thread synchronously.

Drop blocks until the worker thread has fully exited and the LlamaModel / LlamaContext (and LlamaBackend device handles, plus the multimodal projector when the mtmd feature is on) are released. This is intentional: the caller almost always wants to allocate a replacement Client immediately after dropping this one, and a non-blocking drop would briefly hold 2× the model’s RAM/VRAM and risk OOM. Client::reload reuses the same worker and avoids this whole path; prefer it over drop-and-recreate when you can.

To keep the wait short even when a long generation is mid-flight, Drop flips the shared cancel flag before signalling shutdown. The worker polls the flag at every prompt-prefill chunk boundary and every sampled token, so an in-flight Request returns within a single decode step. The pessimal wait is therefore one decode step, not the rest of the generation.

try_send(Shutdown) is best-effort: if the bounded command queue is full at this instant, the Shutdown command isn’t enqueued — but the in-flight request still bails on the cancel flag, and the worker’s per-iteration cancel check at the top of its command loop also exits the thread before pulling more queued commands.

Model clones outliving the Client keep the channel sender count above zero; their send calls naturally fail with SendError once the receiver is dropped on worker exit, so they don’t prevent shutdown.

Auto Trait Implementations§

§

impl !Freeze for Client

§

impl !RefUnwindSafe for Client

§

impl Send for Client

§

impl Sync for Client

§

impl Unpin for Client

§

impl UnsafeUnpin for Client

§

impl !UnwindSafe for Client

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> PolicyExt for T
where T: ?Sized,

Source§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow only if self and other return Action::Follow. Read more
Source§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow if either self or other returns Action::Follow. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

impl<T> WasmCompatSend for T
where T: Send,

Source§

impl<T> WasmCompatSync for T
where T: Sync,