Skip to main content

Crate llama_cpp_4

Crate llama_cpp_4 

Source
Expand description

Bindings to the llama.cpp library.

As llama.cpp is a very fast moving target, this crate does not attempt to create a stable API with all the rust idioms. Instead it provides safe wrappers around nearly direct bindings to llama.cpp. This makes it easier to keep up with the changes in llama.cpp, but does mean that the API is not as nice as it could be.

§Examples

§Feature Flags

  • cuda enables CUDA GPU support.
  • metal enables Apple Metal GPU support.
  • vulkan enables Vulkan GPU support (AMD / Intel / cross-platform).
  • native enables host-CPU optimisations (-march=native).
  • openmp enables OpenMP multi-core CPU parallelism (on by default).
  • rpc enables RPC backend support for distributed inference across multiple machines.
  • mtmd enables multimodal (image + audio) support via libmtmd.

Modules§

common
exposing common llama cpp structures like CommonParams
context
Safe wrapper around llama_context.
ggml
Safe wrappers around core ggml graph computation APIs.
llama_backend
Representation of an initialized llama backend
llama_batch
Safe wrapper around llama_batch.
model
A safe wrapper around llama_model.
mtmd
Safe wrappers for the libmtmd multimodal support library.
quantize
Quantization types and parameters for converting models to lower-bit precisions.
sampling
Safe wrapper around llama_sampler.
token
Safe wrappers around llama_token_data and llama_token_data_array.
token_type
Utilities for working with llama_token_type values.

Enums§

ApplyChatTemplateError
Failed to apply model chat template.
ChatTemplateError
There was an error while getting the chat template from a model.
DecodeError
Failed to decode a batch.
EmbeddingsError
When embedding related functions fail
EncodeError
Failed to decode a batch.
LLamaCppError
All errors that can occur in the llama-cpp crate.
LlamaContextLoadError
Failed to Load context
LlamaLoraAdapterInitError
An error that can occur when loading a model.
LlamaLoraAdapterRemoveError
An error that can occur when loading a model.
LlamaLoraAdapterSetError
An error that can occur when loading a model.
LlamaModelLoadError
An error that can occur when loading a model.
NewLlamaChatMessageError
Failed to apply model chat template.
StringFromModelError
Error retrieving a string from the model (e.g. description, metadata key/value).
StringToTokenError
Failed to convert a string to a token sequence.
TokenToStringError
An error that can occur when converting a token to a string.

Functions§

flash_attn_type_name
Get the name of a flash attention type.
ggml_time_us
Get the time in microseconds according to ggml
llama_supports_mlock
Checks if mlock is supported.
llama_time_us
get the time (in microseconds) according to llama.cpp
log_get
Get the current log callback and user data.
log_set
Set the log callback.
max_devices
get the max number of devices according to llama.cpp (this is generally cuda devices)
max_parallel_sequences
Get the maximum number of parallel sequences supported.
max_tensor_buft_overrides
Get the maximum number of tensor buffer type overrides.
mlock_supported
is memory locking supported according to llama.cpp
mmap_supported
is memory mapping supported according to llama.cpp
model_meta_key_str
Get the string representation of a model metadata key.
model_quantize
Quantize a model file using typed [QuantizeParams].
model_quantize_default_paramsDeprecated
Get default quantization parameters (raw sys type).
opt_epoch
Run one training epoch.
opt_init
Initialize optimizer state for fine-tuning.
opt_param_filter_all
Parameter filter that accepts all tensors (for use with opt_init).
params_fit
Auto-fit model and context parameters for available memory.
print_system_info
Get system information string.
supports_gpu_offload
Checks if GPU offload is supported.
supports_rpc
Checks if RPC backend is supported.

Type Aliases§

Result
A failable result from a llama.cpp function.