Crate llama_cpp_2
source ·Expand description
Bindings to the llama.cpp library.
As llama.cpp is a very fast moving target, this crate does not attempt to create a stable API with all the rust idioms. Instead it provided safe wrappers around nearly direct bindings to llama.cpp. This makes it easier to keep up with the changes in llama.cpp, but does mean that the API is not as nice as it could be.
Examples
Inference
use llama_cpp_2::model::LlamaModel;
use llama_cpp_2::llama_backend::LlamaBackend;
use llama_cpp_2::context::params::LlamaContextParams;
use llama_cpp_2::llama_batch::LlamaBatch;
use llama_cpp_2::model::params::LlamaModelParams;
use llama_cpp_2::token::data_array::LlamaTokenDataArray;
// initialize GGML
let backend = LlamaBackend::init()?;
// load the model (this may be slow)
let model = LlamaModel::load_from_file(&backend, "path/to/model", &LlamaModelParams::default())?;
let prompt = "How do I kill a process on linux?";
let tokens = model.str_to_token(prompt, true)?;
// create a context and batch
let mut context = model.new_context(&backend, &LlamaContextParams::default())?;
let mut batch = LlamaBatch::new(512, 1);
let mut pos: i32 = 0;
// add the prompt to the batch
let last_index = i32::try_from(tokens.len())? - 1;
for token in tokens {
batch.add(token, pos, &[0], pos == last_index);
pos += 1;
}
let mut response = vec![];
// evaluate first 10 tokens
for i in 0..10 {
context.decode(&mut batch)?;
let token = context.sample_token_greedy(LlamaTokenDataArray::from_iter(context.candidates_ith(batch.n_tokens()), false));
response.push(token);
batch.clear();
batch.add(token, pos, &[0], true);
pos += 1;
}
let response_str = model.tokens_to_str(&response)?;
println!("{}", response_str);
Modules
- Safe wrapper around
llama_context
. - The grammar module contains the grammar parser and the grammar struct.
- Representation of an initialized llama backend
- Safe wrapper around
llama_batch
. - A safe wrapper around
llama_model
. - Safe wrapper around
llama_timings
. - Safe wrappers around
llama_token_data
andllama_token_data_array
. - Utilities for working with
llama_token_type
values.
Enums
- Failed to decode a batch.
- All errors that can occur in the llama-cpp crate.
- Failed to Load context
- An error that can occur when loading a model.
- Failed to convert a string to a token sequence.
- An error that can occur when converting a token to a string.
Functions
- get the time (in microseconds) according to llama.cpp
- get the max number of devices according to llama.cpp (this is generally cuda devices)
- is memory locking supported according to llama.cpp
- is memory mapping supported according to llama.cpp
Type Aliases
- A failable result from a llama.cpp function.