Skip to main content

EmbeddingModel

Struct EmbeddingModel 

Source
pub struct EmbeddingModel { /* private fields */ }
Expand description

Represents a loaded embedding model.

This struct encapsulates the llama_cpp_2::LlamaModel and LlamaContext and provides methods for generating embeddings from text input.

§Important

Due to the !Send nature of LlamaContext, instances of this struct cannot be safely sent between threads. Each thread must maintain its own instance.

§Example

use embellama::model::EmbeddingModel;
use embellama::config::ModelConfig;

let config = ModelConfig::builder()
    .with_model_path("path/to/model.gguf")
    .with_model_name("my-model")
    .build()?;

let model = EmbeddingModel::new(&config)?;
assert!(model.is_loaded());

Implementations§

Source§

impl EmbeddingModel

Source

pub fn new(backend: &LlamaBackend, config: &ModelConfig) -> Result<Self>

Creates a new embedding model from the given configuration.

§Arguments
  • backend - The llama backend to use for model loading
  • config - The model configuration containing path and parameters
§Returns

Returns a Result containing the initialized model or an error.

§Errors

This function will return an error if:

  • The model file cannot be loaded
  • The context creation fails
  • Invalid configuration parameters are provided
Source

pub fn load(backend: &LlamaBackend, config: &ModelConfig) -> Result<Self>

Loads a model from disk.

This is an alternative way to create a model, useful when you want to explicitly separate the loading step.

§Arguments
  • backend - The llama backend to use for model loading
  • config - The model configuration
§Returns

Returns a Result containing the loaded model or an error.

§Errors

Returns an error if model loading fails

Source

pub fn unload(self)

Consumes the model and explicitly frees resources.

Note: This happens automatically when the model is dropped. This method exists mainly for explicit resource management.

Source

pub fn is_loaded(&self) -> bool

Checks if the model is currently loaded and ready for inference.

§Returns

Returns true if the model is loaded, false otherwise.

Source

pub fn embedding_dimensions(&self) -> usize

Returns the dimensionality of embeddings produced by this model.

§Returns

The number of dimensions in the embedding vectors.

Source

pub fn max_sequence_length(&self) -> usize

Returns the maximum sequence length supported by this model.

§Returns

The maximum number of tokens that can be processed.

Source

pub fn model_size(&self) -> Option<usize>

Returns the approximate memory footprint of the model in bytes.

§Returns

Estimated memory usage in bytes, or None if the size cannot be calculated (e.g., on 32-bit platforms with very large models).

Source

pub fn model_metadata(&self) -> (String, PathBuf, usize, usize)

Returns the model’s metadata.

§Returns

A tuple containing (model_name, model_path, vocab_size, n_params).

Source

pub fn config(&self) -> &ModelConfig

Returns the model configuration.

Source

pub fn name(&self) -> &str

Returns the model name.

Source

pub fn path(&self) -> &PathBuf

Returns the path to the model file.

Source

pub fn n_seq_max(&self) -> u32

Returns the maximum number of sequences for batch processing.

Source

pub fn effective_max_tokens(&self) -> usize

Calculate the effective maximum tokens available per sequence in batch processing.

When batching multiple sequences, each sequence gets its own KV cache slot. The usable context (n_batch) is divided among sequences based on n_seq_max.

§Returns

The maximum number of input tokens per sequence that can be safely processed.

§Implementation Note

Each sequence slot size = n_batch / n_seq_max - 2

  • n_batch represents the max usable context per sequence (defaults to context_size)
  • The division accounts for parallel sequence processing
  • The 2-token overhead is for special tokens ([CLS], [SEP])
§Example

For a model with n_batch = 8192 and n_seq_max = 2:

  • Per-sequence size: 8192 / 2 = 4096
  • Overhead: 2 tokens ([CLS] and [SEP])
  • Effective max per sequence: 4096 - 2 = 4094 tokens
Source

pub fn tokenize(&self, text: &str) -> Result<Vec<LlamaToken>>

Tokenizes the input text.

§Arguments
  • text - The text to tokenize
§Returns

A vector of tokens.

§Errors

Returns an error if tokenization fails.

Source

pub fn tokenize_cached( &self, text: &str, cache: Option<&TokenCache>, ) -> Result<Vec<LlamaToken>>

Tokenizes the input text with caching support.

§Arguments
  • text - The text to tokenize
  • cache - Optional token cache for caching tokenization results
§Returns

Returns a vector of tokens representing the tokenized text.

§Errors

Returns an error if tokenization fails.

Source

pub fn generate_embedding(&mut self, text: &str) -> Result<Vec<f32>>

Generates an embedding for the given text.

§Arguments
  • text - The input text to generate embeddings for
§Returns

Returns a vector of f32 values representing the embedding.

§Errors

This function will return an error if:

  • Tokenization fails
  • The input exceeds the maximum token limit
  • Model inference fails
Source

pub fn generate_embedding_cached( &mut self, text: &str, token_cache: Option<&TokenCache>, truncate: TruncateTokens, ) -> Result<Vec<f32>>

Generates an embedding for the given text with optional token cache support.

§Arguments
  • text - The input text to generate embeddings for
  • token_cache - Optional token cache for caching tokenization results
  • truncate - Truncation strategy to apply
§Returns

Returns a vector of f32 values representing the embedding.

§Errors

This function will return an error if:

  • Tokenization fails
  • The input exceeds the maximum token limit (when truncation is disabled)
  • Model inference fails
  • Truncation limit exceeds model’s effective maximum
Source

pub fn generate_multi_embedding( &mut self, text: &str, token_cache: Option<&TokenCache>, truncate: TruncateTokens, ) -> Result<Vec<Vec<f32>>>

Generates per-token (multi-vector) embeddings for the given text.

Returns one embedding vector per token, suitable for ColBERT-style late interaction reranking. Each vector is individually normalized according to the model’s normalization mode.

§Arguments
  • text - The input text to generate embeddings for
  • token_cache - Optional token cache for caching tokenization results
  • truncate - Truncation strategy to apply
§Returns

Returns a vector of embedding vectors, one per token.

§Errors

Returns an error if tokenization or model inference fails.

Source

pub fn process_batch_tokens( &mut self, token_sequences: &[Vec<LlamaToken>], truncate: TruncateTokens, ) -> Result<Vec<Vec<f32>>>

Processes multiple token sequences as a batch through the model.

This method enables true batch processing by encoding multiple sequences in a single model pass using unique sequence IDs. If the number of sequences exceeds n_seq_max, it will automatically chunk them.

§Arguments
  • token_sequences - Slice of token sequences to process
  • truncate - Truncation strategy to apply to each sequence
§Returns

Returns a vector of embedding vectors, one for each input sequence.

§Errors

Returns an error if:

  • Context creation fails
  • Batch processing fails
  • Embedding extraction fails
  • Pooling or normalization operations fail
  • Truncation limit exceeds model’s effective maximum
Source

pub fn process_batch_tokens_multi( &mut self, token_sequences: &[Vec<LlamaToken>], truncate: TruncateTokens, ) -> Result<Vec<Vec<Vec<f32>>>>

Processes multiple token sequences as a batch, returning per-token (multi-vector) embeddings.

Each input sequence produces a Vec<Vec<f32>> — one embedding per token. This is the batch equivalent of generate_multi_embedding for ColBERT-style late interaction.

§Arguments
  • token_sequences - Slice of token sequences to process
  • truncate - Truncation strategy to apply to each sequence
§Returns

Returns a vector of multi-vector embeddings, one per input sequence.

§Errors

Returns an error if batch processing, embedding extraction, or normalization fails.

Source

pub fn process_tokens(&mut self, tokens: &[i32]) -> Result<Vec<f32>>

Processes a batch of tokens through the model.

This is a lower-level method used internally for batch processing.

§Arguments
  • tokens - The tokens to process
§Returns

Returns the processed embedding vector.

§Errors

Returns an error if:

  • Token processing fails
  • Pooling operation fails
  • Normalization fails (if enabled)
Source

pub fn generate_rerank_score( &mut self, query: &str, document: &str, truncate: TruncateTokens, ) -> Result<f32>

Generates a reranking relevance score for a query-document pair.

The model encodes the concatenated query and document as a single sequence and returns a scalar relevance score via LlamaPoolingType::Rank.

§Arguments
  • query - The query text
  • document - The document text to score against the query
  • truncate - Truncation strategy for the combined input
§Returns

Returns the raw relevance score (f32). Apply sigmoid for [0,1] normalization.

§Errors

Returns an error if the model is not configured with PoolingStrategy::Rank, tokenization fails, or model inference fails.

Source

pub fn generate_rerank_scores_batch( &mut self, query: &str, documents: &[&str], truncate: TruncateTokens, ) -> Result<Vec<f32>>

Generates reranking scores for multiple documents against a single query.

Processes multiple query-document pairs in batches for efficiency.

§Arguments
  • query - The query text
  • documents - Slice of document texts to score
  • truncate - Truncation strategy for each combined input
§Returns

Returns a vector of raw relevance scores, one per document, in input order.

§Errors

Returns an error if the model is not configured with PoolingStrategy::Rank, tokenization fails, or model inference fails.

Source

pub fn save_session_state(&self) -> Result<Vec<u8>>

Save the current KV cache state to memory

NOTE: This is for advanced prefix caching optimization PERFORMANCE ISSUE: Only beneficial for prefixes > 100 tokens

§Errors

Returns an error if:

  • The context is empty (no state to save)
  • State copy operation fails
Source

pub fn load_session_state(&mut self, state_data: &[u8]) -> Result<()>

Load a previously saved KV cache state

NOTE: Session must be from the same model version BUG: Session format may change between llama.cpp versions

§Errors

Returns an error if:

  • State data is empty
  • State size check fails
Source

pub fn generate_embedding_with_prefix( &mut self, text: &str, prefix_cache: Option<&PrefixCache>, token_cache: Option<&TokenCache>, truncate: TruncateTokens, ) -> Result<Vec<f32>>

Generate embedding with prefix caching support

This method checks if the text has a common prefix that’s been cached, and if so, loads that session state to avoid recomputing the KV cache for the prefix portion.

§Arguments
  • text - The input text to generate embeddings for
  • prefix_cache - Optional reference to the prefix cache
  • token_cache - Optional reference to the token cache
  • truncate - Truncation strategy to apply
§Returns

Returns the embedding vector and optionally the number of prefix tokens used

§Errors

Returns an error if embedding generation fails or truncation limit exceeds model maximum

Trait Implementations§

Source§

impl Drop for EmbeddingModel

Source§

fn drop(&mut self)

Ensures proper cleanup of model resources.

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

impl<A, B, T> HttpServerConnExec<A, B> for T
where B: Body,