Expand description
Bindings to the llama.cpp library.
As llama.cpp is a very fast moving target, this crate does not attempt to create a stable API with all the rust idioms. Instead it provides safe wrappers around nearly direct bindings to llama.cpp. This makes it easier to keep up with the changes in llama.cpp, but does mean that the API is not as nice as it could be.
§Examples
§Feature Flags
cudaenables CUDA GPU support.metalenables Apple Metal GPU support.vulkanenables Vulkan GPU support (AMD / Intel / cross-platform).nativeenables host-CPU optimisations (-march=native).openmpenables OpenMP multi-core CPU parallelism (on by default).rpcenables RPC backend support for distributed inference across multiple machines.mtmdenables multimodal (image + audio) support vialibmtmd.
Modules§
- common
- exposing common llama cpp structures like
CommonParams - context
- Safe wrapper around
llama_context. - ggml
- Safe wrappers around core ggml graph computation APIs.
- llama_
backend - Representation of an initialized llama backend
- llama_
batch - Safe wrapper around
llama_batch. - model
- A safe wrapper around
llama_model. - mtmd
- Safe wrappers for the
libmtmdmultimodal support library. - sampling
- Safe wrapper around
llama_sampler. - token
- Safe wrappers around
llama_token_dataandllama_token_data_array. - token_
type - Utilities for working with
llama_token_typevalues.
Enums§
- Apply
Chat Template Error - Failed to apply model chat template.
- Chat
Template Error - There was an error while getting the chat template from a model.
- Decode
Error - Failed to decode a batch.
- Embeddings
Error - When embedding related functions fail
- Encode
Error - Failed to decode a batch.
- LLama
CppError - All errors that can occur in the llama-cpp crate.
- Llama
Context Load Error - Failed to Load context
- Llama
Lora Adapter Init Error - An error that can occur when loading a model.
- Llama
Lora Adapter Remove Error - An error that can occur when loading a model.
- Llama
Lora Adapter SetError - An error that can occur when loading a model.
- Llama
Model Load Error - An error that can occur when loading a model.
- NewLlama
Chat Message Error - Failed to apply model chat template.
- String
From Model Error - Error retrieving a string from the model (e.g. description, metadata key/value).
- String
ToToken Error - Failed to convert a string to a token sequence.
- Token
ToString Error - An error that can occur when converting a token to a string.
Functions§
- flash_
attn_ type_ name - Get the name of a flash attention type.
- ggml_
time_ us - Get the time in microseconds according to ggml
- llama_
supports_ mlock - Checks if mlock is supported.
- llama_
time_ us - get the time (in microseconds) according to llama.cpp
- log_get⚠
- Get the current log callback and user data.
- log_set⚠
- Set the log callback.
- max_
devices - get the max number of devices according to llama.cpp (this is generally cuda devices)
- max_
parallel_ sequences - Get the maximum number of parallel sequences supported.
- max_
tensor_ buft_ overrides - Get the maximum number of tensor buffer type overrides.
- mlock_
supported - is memory locking supported according to llama.cpp
- mmap_
supported - is memory mapping supported according to llama.cpp
- model_
meta_ key_ str - Get the string representation of a model metadata key.
- model_
quantize - Quantize a model file.
- model_
quantize_ default_ params - Get default quantization parameters.
- opt_
epoch ⚠ - Run one training epoch.
- opt_
init ⚠ - Initialize optimizer state for fine-tuning.
- opt_
param_ ⚠filter_ all - Parameter filter that accepts all tensors (for use with
opt_init). - params_
fit ⚠ - Auto-fit model and context parameters for available memory.
- print_
system_ info - Get system information string.
- supports_
gpu_ offload - Checks if GPU offload is supported.
- supports_
rpc - Checks if RPC backend is supported.
Type Aliases§
- Result
- A failable result from a llama.cpp function.