Crate llama_cpp_2

Expand description

Bindings to the llama.cpp library.

As llama.cpp is a very fast moving target, this crate does not attempt to create a stable API with all the rust idioms. Instead it provided safe wrappers around nearly direct bindings to llama.cpp. This makes it easier to keep up with the changes in llama.cpp, but does mean that the API is not as nice as it could be.

Examples

Inference

use llama_cpp_2::model::LlamaModel;
use llama_cpp_2::llama_backend::LlamaBackend;
use llama_cpp_2::context::params::LlamaContextParams;
use llama_cpp_2::llama_batch::LlamaBatch;
use llama_cpp_2::model::params::LlamaModelParams;
use llama_cpp_2::token::data_array::LlamaTokenDataArray;


// initialize GGML
let backend = LlamaBackend::init()?;

// load the model (this may be slow)
let model = LlamaModel::load_from_file(&backend, "path/to/model", &LlamaModelParams::default())?;
let prompt = "How do I kill a process on linux?";
let tokens = model.str_to_token(prompt, true)?;

// create a context and batch
let mut context = model.new_context(&backend, &LlamaContextParams::default())?;
let mut batch = LlamaBatch::new(512, 1);
let mut pos: i32 = 0;

// add the prompt to the batch
let last_index = i32::try_from(tokens.len())? - 1;
for token in tokens {
    batch.add(token, pos, &[0], pos == last_index);
    pos += 1;
}

let mut response = vec![];

// evaluate first 10 tokens
for i in 0..10 {
    context.decode(&mut batch)?;
    let token = context.sample_token_greedy(LlamaTokenDataArray::from_iter(context.candidates_ith(batch.n_tokens()), false));
    response.push(token);
    batch.clear();
    batch.add(token, pos, &[0], true);
    pos += 1;
}

let response_str = model.tokens_to_str(&response)?;
println!("{}", response_str);

Modules

context
Safe wrapper around llama_context.
grammar
The grammar module contains the grammar parser and the grammar struct.
llama_backend
Representation of an initialized llama backend
llama_batch
Safe wrapper around llama_batch.
model
A safe wrapper around llama_model.
timing
Safe wrapper around llama_timings.
token
Safe wrappers around llama_token_data and llama_token_data_array.
token_type
Utilities for working with llama_token_type values.

Enums

DecodeError
Failed to decode a batch.
LLamaCppError
All errors that can occur in the llama-cpp crate.
LlamaContextLoadError
Failed to Load context
LlamaModelLoadError
An error that can occur when loading a model.
StringToTokenError
Failed to convert a string to a token sequence.
TokenToStringError
An error that can occur when converting a token to a string.

Functions

llama_time_us
get the time (in microseconds) according to llama.cpp
max_devices
get the max number of devices according to llama.cpp (this is generally cuda devices)
mlock_supported
is memory locking supported according to llama.cpp
mmap_supported
is memory mapping supported according to llama.cpp

Type Aliases

Result
A failable result from a llama.cpp function.