Crate llama_cpp_4

Expand description

Bindings to the llama.cpp library.

As llama.cpp is a very fast moving target, this crate does not attempt to create a stable API with all the rust idioms. Instead it provides safe wrappers around nearly direct bindings to llama.cpp. This makes it easier to keep up with the changes in llama.cpp, but does mean that the API is not as nice as it could be.

§Examples

§Feature Flags

cuda enables CUDA GPU support.
metal enables Apple Metal GPU support.
vulkan enables Vulkan GPU support (AMD / Intel / cross-platform).
native enables host-CPU optimisations (-march=native).
openmp enables OpenMP multi-core CPU parallelism (on by default).
rpc enables RPC backend support for distributed inference across multiple machines.
mtmd enables multimodal (image + audio) support via libmtmd.

Modules§

common: exposing common llama cpp structures like CommonParams
context: Safe wrapper around llama_context.
ggml: Safe wrappers around core ggml graph computation APIs.
llama_backend: Representation of an initialized llama backend
llama_batch: Safe wrapper around llama_batch.
model: A safe wrapper around llama_model.
mtmd: Safe wrappers for the libmtmd multimodal support library.
quantize: Quantization types and parameters for converting models to lower-bit precisions.
sampling: Safe wrapper around llama_sampler.
token: Safe wrappers around llama_token_data and llama_token_data_array.
token_type: Utilities for working with llama_token_type values.

Enums§

ApplyChatTemplateError: Failed to apply model chat template.
ChatTemplateError: There was an error while getting the chat template from a model.
DecodeError: Failed to decode a batch.
EmbeddingsError: When embedding related functions fail
EncodeError: Failed to decode a batch.
LLamaCppError: All errors that can occur in the llama-cpp crate.
LlamaContextLoadError: Failed to Load context
LlamaLoraAdapterInitError: An error that can occur when loading a model.
LlamaLoraAdapterRemoveError: An error that can occur when loading a model.
LlamaLoraAdapterSetError: An error that can occur when loading a model.
LlamaModelLoadError: An error that can occur when loading a model.
NewLlamaChatMessageError: Failed to apply model chat template.
StringFromModelError: Error retrieving a string from the model (e.g. description, metadata key/value).
StringToTokenError: Failed to convert a string to a token sequence.
TokenToStringError: An error that can occur when converting a token to a string.

Functions§

flash_attn_type_name: Get the name of a flash attention type.
ggml_time_us: Get the time in microseconds according to ggml
llama_supports_mlock: Checks if mlock is supported.
llama_time_us: get the time (in microseconds) according to llama.cpp
log_get^⚠: Get the current log callback and user data.
log_set^⚠: Set the log callback.
max_devices: get the max number of devices according to llama.cpp (this is generally cuda devices)
max_parallel_sequences: Get the maximum number of parallel sequences supported.
max_tensor_buft_overrides: Get the maximum number of tensor buffer type overrides.
mlock_supported: is memory locking supported according to llama.cpp
mmap_supported: is memory mapping supported according to llama.cpp
model_meta_key_str: Get the string representation of a model metadata key.
model_quantize: Quantize a model file using typed [QuantizeParams].
model_quantize_default_paramsDeprecated: Get default quantization parameters (raw sys type).
opt_epoch^⚠: Run one training epoch.
opt_init^⚠: Initialize optimizer state for fine-tuning.
opt_param_filter_all^⚠: Parameter filter that accepts all tensors (for use with opt_init).
params_fit^⚠: Auto-fit model and context parameters for available memory.
print_system_info: Get system information string.
supports_gpu_offload: Checks if GPU offload is supported.
supports_rpc: Checks if RPC backend is supported.

Type Aliases§

Result: A failable result from a llama.cpp function.

Crate llama_cpp_4

Crate llama_cpp_4 Copy item path

§Examples

§Feature Flags

Modules§

Enums§

Functions§

Type Aliases§

Crate llama_cpp_4