Struct LlamaModelParams

Source

pub struct LlamaModelParams { /* private fields */ }

Expand description

A safe wrapper around llama_model_params.

Implementations§

Source §

impl LlamaModelParams

Source

pub fn add_cpu_moe_override(self: Pin<&mut Self>)

Adds buffer type overides to move all mixture-of-experts layers to CPU.

Source

pub fn add_cpu_buft_override(self: Pin<&mut Self>, key: &CStr)

Appends a buffer type override to the model parameters, to move layers matching pattern to CPU. It must be pinned as this creates a self-referential struct.

Source §

impl LlamaModelParams

Source

pub fn fit_params( self: Pin<&mut Self>, model_path: &CStr, cparams: &mut LlamaContextParams, margins: &mut [usize], n_ctx_min: u32, log_level: ggml_log_level, ) -> Result<FitResult, FitError>

Automatically fit model parameters to available device memory.

Wraps llama.cpp’s llama_params_fit, which determines optimal n_gpu_layers, tensor_split, and tensor_buft_overrides based on available VRAM. On success the model and context params are updated in place.

§Requirements

Per the C API docstring, only parameters that still hold their default value are modified. In practice this means:

n_gpu_layers must be at its default (-1). Do not call with_n_gpu_layers before this.
No tensor_buft_overrides may be set. Do not call add_cpu_buft_override or add_cpu_moe_override before this.
cparams.n_ctx is only auto-selected if it is 0; otherwise it is left alone.

§Arguments

model_path — path to the GGUF model file.
cparams — context parameters; n_ctx may be modified (see above).
margins — memory margin per device in bytes. Must have at least llama_max_devices() elements.
n_ctx_min — minimum context size to preserve when reducing memory usage.
log_level — minimum log level for fitting output; lower levels are routed to the debug log.

§Thread safety

This function is not thread safe: the underlying C call mutates the global llama logger state.

§Errors

Returns FitError::Failure if no fitting allocation could be found, or FitError::Error on a hard error (e.g. the model file could not be read).

Source §

impl LlamaModelParams

Source

pub fn n_gpu_layers(&self) -> i32

Get the number of layers to offload to the GPU.

Source

pub fn main_gpu(&self) -> i32

The GPU that is used for scratch and small tensors

Source

pub fn vocab_only(&self) -> bool

only load the vocabulary, no weights

Source

pub fn use_mmap(&self) -> bool

use mmap if possible

Source

pub fn use_mlock(&self) -> bool

force system to keep model in RAM

Source

pub fn split_mode(&self) -> Result<LlamaSplitMode, LlamaSplitModeParseError>

get the split mode

§Errors

Returns LlamaSplitModeParseError if the unknown split mode is encountered.

Source

pub fn devices(&self) -> Vec<usize>

get the devices

Source

pub fn with_n_gpu_layers(self, n_gpu_layers: u32) -> Self

sets the number of gpu layers to offload to the GPU.

let params = LlamaModelParams::default();
let params = params.with_n_gpu_layers(1);
assert_eq!(params.n_gpu_layers(), 1);

Source

pub fn with_main_gpu(self, main_gpu: i32) -> Self

sets the main GPU

To enable this option, you must set split_mode to LlamaSplitMode::None to enable single GPU mode.

Source

pub fn with_vocab_only(self, vocab_only: bool) -> Self

sets vocab_only

Source

pub fn with_use_mmap(self, use_mmap: bool) -> Self

sets use_mmap

Source

pub fn with_use_mlock(self, use_mlock: bool) -> Self

sets use_mlock

Source

pub fn with_split_mode(self, split_mode: LlamaSplitMode) -> Self

sets split_mode

Source

pub fn with_devices(self, devices: &[usize]) -> Result<Self, LlamaCppError>

sets devices

The devices are specified as indices that correspond to the ggml backend device indices.

The maximum number of devices is 16.

You don’t need to specify CPU or ACCEL devices.

§Errors

Returns LlamaCppError::BackendDeviceNotFound if any device index is invalid.

Source

pub fn with_no_alloc(self, no_alloc: bool) -> Self

Set no_alloc

If this parameter is true, don’t allocate memory for the tensor data

You can’t use no_alloc with use_mmap, so this also sets use_mmap to false.

Source

pub fn no_alloc(&self) -> bool

Get no_alloc

If this parameter is true, don’t allocate memory for the tensor data

Trait Implementations§

Source §

impl Debug for LlamaModelParams

Source §

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Source §

impl Default for LlamaModelParams

Default parameters for LlamaModel. (as defined in llama.cpp by llama_model_default_params)

use llama_cpp_2::model::params::LlamaSplitMode;
let params = LlamaModelParams::default();
assert_eq!(params.n_gpu_layers(), -1, "n_gpu_layers should be -1");
assert_eq!(params.main_gpu(), 0, "main_gpu should be 0");
assert_eq!(params.vocab_only(), false, "vocab_only should be false");
assert_eq!(params.use_mmap(), true, "use_mmap should be true");
assert_eq!(params.use_mlock(), false, "use_mlock should be false");
assert_eq!(params.split_mode(), Ok(LlamaSplitMode::Layer), "split_mode should be LAYER");
assert_eq!(params.devices().len(), 0, "devices should be empty");
assert_eq!(params.no_alloc(), false, "no_alloc should be false");