Struct EmbeddingModel

Source

pub struct EmbeddingModel { /* private fields */ }

Expand description

Represents a loaded embedding model.

This struct encapsulates the llama_cpp_2::LlamaModel and LlamaContext and provides methods for generating embeddings from text input.

§Important

Due to the !Send nature of LlamaContext, instances of this struct cannot be safely sent between threads. Each thread must maintain its own instance.

§Example

use embellama::model::EmbeddingModel;
use embellama::config::ModelConfig;

let config = ModelConfig::builder()
    .with_model_path("path/to/model.gguf")
    .with_model_name("my-model")
    .build()?;

let model = EmbeddingModel::new(&config)?;
assert!(model.is_loaded());

Implementations§

Source §

impl EmbeddingModel

Source

pub fn new(backend: &LlamaBackend, config: &ModelConfig) -> Result<Self>

Creates a new embedding model from the given configuration.

§Arguments

backend - The llama backend to use for model loading
config - The model configuration containing path and parameters

§Returns

Returns a Result containing the initialized model or an error.

§Errors

This function will return an error if:

The model file cannot be loaded
The context creation fails
Invalid configuration parameters are provided

Source

pub fn load(backend: &LlamaBackend, config: &ModelConfig) -> Result<Self>

Loads a model from disk.

This is an alternative way to create a model, useful when you want to explicitly separate the loading step.

§Arguments

backend - The llama backend to use for model loading
config - The model configuration

§Returns

Returns a Result containing the loaded model or an error.

§Errors

Returns an error if model loading fails

Source

pub fn unload(self)

Consumes the model and explicitly frees resources.

Note: This happens automatically when the model is dropped. This method exists mainly for explicit resource management.

Source

pub fn is_loaded(&self) -> bool

Checks if the model is currently loaded and ready for inference.

§Returns

Returns true if the model is loaded, false otherwise.

Source

pub fn embedding_dimensions(&self) -> usize

Returns the dimensionality of embeddings produced by this model.

§Returns

The number of dimensions in the embedding vectors.

Source

pub fn max_sequence_length(&self) -> usize

Returns the maximum sequence length supported by this model.

§Returns

The maximum number of tokens that can be processed.

Source

pub fn model_size(&self) -> Option<usize>

Returns the approximate memory footprint of the model in bytes.

§Returns

Estimated memory usage in bytes, or None if the size cannot be calculated (e.g., on 32-bit platforms with very large models).

Source

pub fn model_metadata(&self) -> (String, PathBuf, usize, usize)

Returns the model’s metadata.

§Returns

A tuple containing (model_name, model_path, vocab_size, n_params).

Source

pub fn config(&self) -> &ModelConfig

Returns the model configuration.

Source

pub fn name(&self) -> &str

Returns the model name.

Source

pub fn path(&self) -> &PathBuf

Returns the path to the model file.

Source

pub fn n_seq_max(&self) -> u32

Returns the maximum number of sequences for batch processing.

Source

pub fn effective_max_tokens(&self) -> usize

Calculate the effective maximum tokens available per sequence in batch processing.

When batching multiple sequences, each sequence gets its own KV cache slot. The usable context (n_batch) is divided among sequences based on n_seq_max.

§Returns

The maximum number of input tokens per sequence that can be safely processed.

§Implementation Note

Each sequence slot size = n_batch / n_seq_max - 2

n_batch represents the max usable context per sequence (defaults to context_size)
The division accounts for parallel sequence processing
The 2-token overhead is for special tokens ([CLS], [SEP])

§Example

For a model with n_batch = 8192 and n_seq_max = 2:

Per-sequence size: 8192 / 2 = 4096
Overhead: 2 tokens ([CLS] and [SEP])
Effective max per sequence: 4096 - 2 = 4094 tokens

Source

pub fn tokenize(&self, text: &str) -> Result<Vec<LlamaToken>>

Tokenizes the input text.

§Arguments

text - The text to tokenize

§Returns

A vector of tokens.

§Errors

Returns an error if tokenization fails.

Source

pub fn tokenize_cached( &self, text: &str, cache: Option<&TokenCache>, ) -> Result<Vec<LlamaToken>>

Tokenizes the input text with caching support.

§Arguments

text - The text to tokenize
cache - Optional token cache for caching tokenization results

§Returns

Returns a vector of tokens representing the tokenized text.

§Errors

Returns an error if tokenization fails.

Source

pub fn generate_embedding(&mut self, text: &str) -> Result<Vec<f32>>

Generates an embedding for the given text.

§Arguments

text - The input text to generate embeddings for

§Returns

Returns a vector of f32 values representing the embedding.

§Errors

This function will return an error if:

Tokenization fails
The input exceeds the maximum token limit
Model inference fails

Source

pub fn generate_embedding_cached( &mut self, text: &str, token_cache: Option<&TokenCache>, truncate: TruncateTokens, ) -> Result<Vec<f32>>

Generates an embedding for the given text with optional token cache support.

§Arguments

text - The input text to generate embeddings for
token_cache - Optional token cache for caching tokenization results
truncate - Truncation strategy to apply

§Returns

Returns a vector of f32 values representing the embedding.

§Errors

This function will return an error if:

Tokenization fails
The input exceeds the maximum token limit (when truncation is disabled)
Model inference fails
Truncation limit exceeds model’s effective maximum

Source

pub fn generate_multi_embedding( &mut self, text: &str, token_cache: Option<&TokenCache>, truncate: TruncateTokens, ) -> Result<Vec<Vec<f32>>>

Generates per-token (multi-vector) embeddings for the given text.

Returns one embedding vector per token, suitable for ColBERT-style late interaction reranking. Each vector is individually normalized according to the model’s normalization mode.

§Arguments

text - The input text to generate embeddings for
token_cache - Optional token cache for caching tokenization results
truncate - Truncation strategy to apply

§Returns

Returns a vector of embedding vectors, one per token.

§Errors

Returns an error if tokenization or model inference fails.

Source

pub fn process_batch_tokens( &mut self, token_sequences: &[Vec<LlamaToken>], truncate: TruncateTokens, ) -> Result<Vec<Vec<f32>>>

Processes multiple token sequences as a batch through the model.

This method enables true batch processing by encoding multiple sequences in a single model pass using unique sequence IDs. If the number of sequences exceeds n_seq_max, it will automatically chunk them.

§Arguments

token_sequences - Slice of token sequences to process
truncate - Truncation strategy to apply to each sequence

§Returns

Returns a vector of embedding vectors, one for each input sequence.

§Errors

Returns an error if:

Context creation fails
Batch processing fails
Embedding extraction fails
Pooling or normalization operations fail
Truncation limit exceeds model’s effective maximum

Source

pub fn process_batch_tokens_multi( &mut self, token_sequences: &[Vec<LlamaToken>], truncate: TruncateTokens, ) -> Result<Vec<Vec<Vec<f32>>>>

Processes multiple token sequences as a batch, returning per-token (multi-vector) embeddings.

Each input sequence produces a Vec<Vec<f32>> — one embedding per token. This is the batch equivalent of generate_multi_embedding for ColBERT-style late interaction.

§Arguments

token_sequences - Slice of token sequences to process
truncate - Truncation strategy to apply to each sequence

§Returns

Returns a vector of multi-vector embeddings, one per input sequence.

§Errors

Returns an error if batch processing, embedding extraction, or normalization fails.

Source

pub fn process_tokens(&mut self, tokens: &[i32]) -> Result<Vec<f32>>

Processes a batch of tokens through the model.

This is a lower-level method used internally for batch processing.

§Arguments

tokens - The tokens to process

§Returns

Returns the processed embedding vector.

§Errors

Returns an error if:

Token processing fails
Pooling operation fails
Normalization fails (if enabled)

Source

pub fn generate_rerank_score( &mut self, query: &str, document: &str, truncate: TruncateTokens, ) -> Result<f32>

Generates a reranking relevance score for a query-document pair.

The model encodes the concatenated query and document as a single sequence and returns a scalar relevance score via LlamaPoolingType::Rank.

§Arguments

query - The query text
document - The document text to score against the query
truncate - Truncation strategy for the combined input

§Returns

Returns the raw relevance score (f32). Apply sigmoid for [0,1] normalization.

§Errors

Returns an error if the model is not configured with PoolingStrategy::Rank, tokenization fails, or model inference fails.

Source

pub fn generate_rerank_scores_batch( &mut self, query: &str, documents: &[&str], truncate: TruncateTokens, ) -> Result<Vec<f32>>

Generates reranking scores for multiple documents against a single query.

Processes multiple query-document pairs in batches for efficiency.

§Arguments

query - The query text
documents - Slice of document texts to score
truncate - Truncation strategy for each combined input

§Returns

Returns a vector of raw relevance scores, one per document, in input order.

§Errors

Returns an error if the model is not configured with PoolingStrategy::Rank, tokenization fails, or model inference fails.

Source

pub fn save_session_state(&self) -> Result<Vec<u8>>

Save the current KV cache state to memory

NOTE: This is for advanced prefix caching optimization PERFORMANCE ISSUE: Only beneficial for prefixes > 100 tokens

§Errors

Returns an error if:

The context is empty (no state to save)
State copy operation fails

Source

pub fn load_session_state(&mut self, state_data: &[u8]) -> Result<()>

Load a previously saved KV cache state

NOTE: Session must be from the same model version BUG: Session format may change between llama.cpp versions

§Errors

Returns an error if:

State data is empty
State size check fails

Source

pub fn generate_embedding_with_prefix( &mut self, text: &str, prefix_cache: Option<&PrefixCache>, token_cache: Option<&TokenCache>, truncate: TruncateTokens, ) -> Result<Vec<f32>>

Generate embedding with prefix caching support

This method checks if the text has a common prefix that’s been cached, and if so, loads that session state to avoid recomputing the KV cache for the prefix portion.

§Arguments

text - The input text to generate embeddings for
prefix_cache - Optional reference to the prefix cache
token_cache - Optional reference to the token cache
truncate - Truncation strategy to apply

§Returns

Returns the embedding vector and optionally the number of prefix tokens used

§Errors

Returns an error if embedding generation fails or truncation limit exceeds model maximum

Trait Implementations§

Source §

impl Drop for EmbeddingModel

Source §

fn drop(&mut self)

Ensures proper cleanup of model resources.

Auto Trait Implementations§

§

impl UnwindSafe for EmbeddingModel

Blanket Implementations§

Source §

impl<T> Any for T
where T: 'static + ?Sized,

Source §

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

Source §

impl<T> Borrow<T> for T
where T: ?Sized,

Source §

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

Source §

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source §

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

Source §

impl<T> From<T> for T

Source §

fn from(t: T) -> T

Returns the argument unchanged.

Source §

impl<T> Instrument for T

Source §

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more

Source §

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more

Source §

impl<T, U> Into for T
where U: From<T>,

Source §

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source §

impl<T> IntoEither for T

Source §

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §