pub struct LlamaModel {
pub model: NonNull<llama_model>,
/* private fields */
}Expand description
A safe wrapper around llama_model.
Fields§
§model: NonNull<llama_model>Raw pointer to the underlying llama_model.
Implementations§
Source§impl LlamaModel
impl LlamaModel
Sourcepub fn vocab_ptr(&self) -> *const llama_vocab
pub fn vocab_ptr(&self) -> *const llama_vocab
Returns a raw pointer to the model’s vocabulary.
Sourcepub fn n_ctx_train(&self) -> Result<u32, TryFromIntError>
pub fn n_ctx_train(&self) -> Result<u32, TryFromIntError>
Get the number of tokens the model was trained on.
§Errors
Returns an error if the value returned by llama.cpp does not fit into a u32.
Sourcepub fn tokens(
&self,
decode_special: bool,
) -> impl Iterator<Item = (LlamaToken, Result<String, TokenToStringError>)> + '_
pub fn tokens( &self, decode_special: bool, ) -> impl Iterator<Item = (LlamaToken, Result<String, TokenToStringError>)> + '_
Get all tokens in the model.
Sourcepub fn token_bos(&self) -> LlamaToken
pub fn token_bos(&self) -> LlamaToken
Get the beginning of stream token.
Sourcepub fn token_eos(&self) -> LlamaToken
pub fn token_eos(&self) -> LlamaToken
Get the end of stream token.
Sourcepub fn token_nl(&self) -> LlamaToken
pub fn token_nl(&self) -> LlamaToken
Get the newline token.
Sourcepub fn is_eog_token(&self, token: &SampledToken) -> bool
pub fn is_eog_token(&self, token: &SampledToken) -> bool
Check if a token represents the end of generation (end of turn, end of sequence, etc.)
Sourcepub fn decode_start_token(&self) -> LlamaToken
pub fn decode_start_token(&self) -> LlamaToken
Get the decoder start token.
Sourcepub fn token_sep(&self) -> LlamaToken
pub fn token_sep(&self) -> LlamaToken
Get the separator token (SEP).
Sourcepub fn str_to_token(
&self,
str: &str,
add_bos: AddBos,
) -> Result<Vec<LlamaToken>, StringToTokenError>
pub fn str_to_token( &self, str: &str, add_bos: AddBos, ) -> Result<Vec<LlamaToken>, StringToTokenError>
Convert a string to a Vector of tokens.
§Errors
- if
strcontains a null byte - if an integer conversion fails during tokenization
use llama_cpp_bindings::model::LlamaModel;
use std::path::Path;
use llama_cpp_bindings::model::AddBos;
let backend = llama_cpp_bindings::llama_backend::LlamaBackend::init()?;
let model = LlamaModel::load_from_file(&backend, Path::new("path/to/model"), &Default::default())?;
let tokens = model.str_to_token("Hello, World!", AddBos::Always)?;Sourcepub fn token_attr(
&self,
LlamaToken: LlamaToken,
) -> Result<LlamaTokenAttrs, LlamaTokenAttrsFromIntError>
pub fn token_attr( &self, LlamaToken: LlamaToken, ) -> Result<LlamaTokenAttrs, LlamaTokenAttrsFromIntError>
Sourcepub fn token_to_piece(
&self,
token: &SampledToken,
decoder: &mut Decoder,
special: bool,
lstrip: Option<NonZeroU16>,
) -> Result<String, TokenToStringError>
pub fn token_to_piece( &self, token: &SampledToken, decoder: &mut Decoder, special: bool, lstrip: Option<NonZeroU16>, ) -> Result<String, TokenToStringError>
Convert a token to a string using the underlying llama.cpp llama_token_to_piece function.
This is the new default function for token decoding and provides direct access to the llama.cpp token decoding functionality without any special logic or filtering.
Decoding raw string requires using an decoder, tokens from language models may not always map to full characters depending on the encoding so stateful decoding is required, otherwise partial strings may be lost! Invalid characters are mapped to REPLACEMENT CHARACTER making the method safe to use even if the model inherently produces garbage.
§Errors
-
if the token type is unknown
-
if the returned size from llama.cpp does not fit into a
usize
Sourcepub fn token_to_piece_bytes(
&self,
token: LlamaToken,
buffer_size: usize,
special: bool,
lstrip: Option<NonZeroU16>,
) -> Result<Vec<u8>, TokenToStringError>
pub fn token_to_piece_bytes( &self, token: LlamaToken, buffer_size: usize, special: bool, lstrip: Option<NonZeroU16>, ) -> Result<Vec<u8>, TokenToStringError>
Raw token decoding to bytes, use if you want to handle the decoding model output yourself
Convert a token to bytes using the underlying llama.cpp llama_token_to_piece function. This is mostly
a thin wrapper around llama_token_to_piece function, that handles rust <-> c type conversions while
letting the caller handle errors. For a safer interface returning rust strings directly use token_to_piece instead!
§Errors
- if the token type is unknown
- the resultant token is larger than
buffer_size. - if an integer conversion fails
Sourcepub fn n_vocab(&self) -> i32
pub fn n_vocab(&self) -> i32
The number of tokens the model was trained on.
This returns a c_int for maximum compatibility. Most of the time it can be cast to an i32
without issue.
Sourcepub fn vocab_type(&self) -> Result<VocabType, VocabTypeFromIntError>
pub fn vocab_type(&self) -> Result<VocabType, VocabTypeFromIntError>
The type of vocab the model was trained on.
§Errors
Returns an error if llama.cpp emits a vocab type that is not known to this library.
Sourcepub fn n_embd(&self) -> c_int
pub fn n_embd(&self) -> c_int
This returns a c_int for maximum compatibility. Most of the time it can be cast to an i32
without issue.
Sourcepub fn is_recurrent(&self) -> bool
pub fn is_recurrent(&self) -> bool
Returns whether the model is a recurrent network (Mamba, RWKV, etc)
Sourcepub fn n_layer(&self) -> Result<u32, TryFromIntError>
pub fn n_layer(&self) -> Result<u32, TryFromIntError>
Returns the number of layers within the model.
§Errors
Returns an error if the layer count returned by llama.cpp does not fit into a u32.
Sourcepub fn n_head(&self) -> Result<u32, TryFromIntError>
pub fn n_head(&self) -> Result<u32, TryFromIntError>
Returns the number of attention heads within the model.
§Errors
Returns an error if the head count returned by llama.cpp does not fit into a u32.
Sourcepub fn n_head_kv(&self) -> Result<u32, TryFromIntError>
pub fn n_head_kv(&self) -> Result<u32, TryFromIntError>
Returns the number of KV attention heads.
§Errors
Returns an error if the KV head count returned by llama.cpp does not fit into a u32.
Sourcepub fn is_hybrid(&self) -> bool
pub fn is_hybrid(&self) -> bool
Returns whether the model is a hybrid network (Jamba, Granite, Qwen3xx, etc.)
Hybrid models have both attention layers and recurrent/SSM layers.
Sourcepub fn meta_val_str(&self, key: &str) -> Result<String, MetaValError>
pub fn meta_val_str(&self, key: &str) -> Result<String, MetaValError>
Get metadata value as a string by key name
§Errors
Returns an error if the key is not found or the value is not valid UTF-8.
Sourcepub fn meta_count(&self) -> i32
pub fn meta_count(&self) -> i32
Get the number of metadata key/value pairs
Sourcepub fn meta_key_by_index(&self, index: i32) -> Result<String, MetaValError>
pub fn meta_key_by_index(&self, index: i32) -> Result<String, MetaValError>
Get metadata key name by index
§Errors
Returns an error if the index is out of range or the key is not valid UTF-8.
Sourcepub fn meta_val_str_by_index(&self, index: i32) -> Result<String, MetaValError>
pub fn meta_val_str_by_index(&self, index: i32) -> Result<String, MetaValError>
Get metadata value as a string by index
§Errors
Returns an error if the index is out of range or the value is not valid UTF-8.
Sourcepub fn chat_template(
&self,
name: Option<&str>,
) -> Result<LlamaChatTemplate, ChatTemplateError>
pub fn chat_template( &self, name: Option<&str>, ) -> Result<LlamaChatTemplate, ChatTemplateError>
Get chat template from model by name. If the name parameter is None, the default chat template will be returned.
You supply this into Self::apply_chat_template to get back a string with the appropriate template
substitution applied to convert a list of messages into a prompt the LLM can use to complete
the chat.
You could also use an external jinja parser, like minijinja, to parse jinja templates not supported by the llama.cpp template engine.
§Errors
- If the model has no chat template by that name
§Panics
Panics if the C-returned chat template string contains interior null bytes (should never happen with valid model data).
Sourcepub fn load_from_file(
_: &LlamaBackend,
path: impl AsRef<Path>,
params: &LlamaModelParams,
) -> Result<Self, LlamaModelLoadError>
pub fn load_from_file( _: &LlamaBackend, path: impl AsRef<Path>, params: &LlamaModelParams, ) -> Result<Self, LlamaModelLoadError>
Loads a model from a file.
§Errors
See LlamaModelLoadError for more information.
§Panics
Panics if a valid UTF-8 path somehow contains interior null bytes (should never happen).
Sourcepub fn lora_adapter_init(
&self,
path: impl AsRef<Path>,
) -> Result<LlamaLoraAdapter, LlamaLoraAdapterInitError>
pub fn lora_adapter_init( &self, path: impl AsRef<Path>, ) -> Result<LlamaLoraAdapter, LlamaLoraAdapterInitError>
Sourcepub fn apply_chat_template(
&self,
tmpl: &LlamaChatTemplate,
chat: &[LlamaChatMessage],
add_ass: bool,
) -> Result<String, ApplyChatTemplateError>
pub fn apply_chat_template( &self, tmpl: &LlamaChatTemplate, chat: &[LlamaChatMessage], add_ass: bool, ) -> Result<String, ApplyChatTemplateError>
Apply the models chat template to some messages. See https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template
Unlike the llama.cpp apply_chat_template which just randomly uses the ChatML template when given
a null pointer for the template, this requires an explicit template to be specified. If you want to
use “chatml”, then just do LlamaChatTemplate::new("chatml") or any other model name or template
string.
Use Self::chat_template to retrieve the template baked into the model (this is the preferred
mechanism as using the wrong chat template can result in really unexpected responses from the LLM).
You probably want to set add_ass to true so that the generated template string ends with a the
opening tag of the assistant. If you fail to leave a hanging chat tag, the model will likely generate
one into the output and the output may also have unexpected output aside from that.
§Errors
There are many ways this can fail. See ApplyChatTemplateError for more information.
Sourcepub fn sampled_token_classifier(&self) -> SampledTokenClassifier<'_>
pub fn sampled_token_classifier(&self) -> SampledTokenClassifier<'_>
Build a streaming SampledTokenClassifier for this model.
At construction the bindings detect reasoning markers (via the autoparser, with a chunked-thinking fallback for templates that consume thoughts via content blocks), tool-call markers, and the trailing generation-prompt slice. The classifier then runs a state machine over the decoded token stream — no per-model branches.
If the model has no usable chat template the classifier is built in a
blind mode that classifies every token as
SampledToken::Undeterminable.
Sourcepub fn streaming_markers(
&self,
) -> Result<StreamingMarkers, MarkerDetectionError>
pub fn streaming_markers( &self, ) -> Result<StreamingMarkers, MarkerDetectionError>
Detect reasoning / tool-call markers (as token-ID sequences) and the
trailing generation-prompt slice for this model’s chat template. The
returned StreamingMarkers carry tokenised markers — never raw strings
— so the classifier matches by LlamaToken equality rather than text
scanning.
§Errors
Returns MarkerDetectionError when any underlying FFI call fails.
Sourcepub fn reasoning_markers(
&self,
) -> Result<Option<ReasoningMarkers>, MarkerDetectionError>
pub fn reasoning_markers( &self, ) -> Result<Option<ReasoningMarkers>, MarkerDetectionError>
§Errors
Returns MarkerDetectionError when the underlying FFI call fails.
Sourcepub fn tool_call_markers(&self) -> Option<ToolCallMarkers>
pub fn tool_call_markers(&self) -> Option<ToolCallMarkers>
Returns the rich tool-call marker bundle (open / separator / close /
optional value-quote pair) for this model’s chat template, sourced from
the wrapper’s per-template override registry. Returns None when no
registered override matches — callers in that case fall back to
llama.cpp’s autoparser via Self::parse_chat_message.
Sourcepub fn parse_chat_message(
&self,
tools_json: &str,
input: &str,
is_partial: bool,
) -> Result<ChatMessageParseOutcome, ParseChatMessageError>
pub fn parse_chat_message( &self, tools_json: &str, input: &str, is_partial: bool, ) -> Result<ChatMessageParseOutcome, ParseChatMessageError>
Parse the assistant’s output text into structured content, reasoning, and tool calls.
Two passes, in order:
- Duck-type the wrapper-side parsers across every known shape (Qwen XML, GLM key-value, Gemma paired-quote, Mistral bracketed-JSON). First match wins. The shapes are ordered so that more restrictive shapes run first, which keeps the duck-type pass safe for inputs that share an open marker but differ in inner structure.
- Delegate to llama.cpp’s
common_chat_parse. If it succeeds the result isRecognized; if it throwsParseExceptionthe result isUnrecognizedwith the raw input plus the FFI’s diagnostic, so the caller can pass the unstructured tokens to the client.
Empty tool-call id fields are filled with call_{index} before
returning, so callers always see well-formed identifiers.
tools_json is a JSON-array string of OpenAI-style tool definitions
(use "[]" when no tools are in scope). is_partial switches between
mid-stream (lenient) and final (strict) parses for the FFI step.
§Errors
Returns ParseChatMessageError when tools_json is not valid JSON,
the FFI returns a non-OK status other than ParseException, or
accessor strings are not valid UTF-8.
Sourcepub fn diagnose_tool_call_synthetic_renders(
&self,
) -> Result<(String, String), MarkerDetectionError>
pub fn diagnose_tool_call_synthetic_renders( &self, ) -> Result<(String, String), MarkerDetectionError>
Render the model’s chat template with the autoparser’s synthetic
no-tools and with-tools inputs. Returns (output_no_tools, output_with_tools). Either side can be empty when the template throws
during rendering. Useful for debugging tool-call marker detection.
§Errors
Returns MarkerDetectionError when the C++ analyzer throws or the FFI
returns a non-OK status.
Source§impl LlamaModel
impl LlamaModel
Sourcepub fn approximate_tok_env(&self) -> Arc<ApproximateTokEnv>
pub fn approximate_tok_env(&self) -> Arc<ApproximateTokEnv>
Returns a process-cached, approximate token environment built from this model’s vocabulary.
The first call iterates the full vocabulary and constructs the trie; subsequent calls
return the cached Arc without further FFI work.
Trait Implementations§
Source§impl Debug for LlamaModel
impl Debug for LlamaModel
Source§impl Drop for LlamaModel
impl Drop for LlamaModel
impl Send for LlamaModel
impl Sync for LlamaModel
Auto Trait Implementations§
impl !Freeze for LlamaModel
impl RefUnwindSafe for LlamaModel
impl Unpin for LlamaModel
impl UnsafeUnpin for LlamaModel
impl UnwindSafe for LlamaModel
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more