pub struct EmbeddingModel { /* private fields */ }Expand description
Represents a loaded embedding model.
This struct encapsulates the llama_cpp_2::LlamaModel and LlamaContext
and provides methods for generating embeddings from text input.
§Important
Due to the !Send nature of LlamaContext, instances of this struct
cannot be safely sent between threads. Each thread must maintain its
own instance.
§Example
use embellama::model::EmbeddingModel;
use embellama::config::ModelConfig;
let config = ModelConfig::builder()
.with_model_path("path/to/model.gguf")
.with_model_name("my-model")
.build()?;
let model = EmbeddingModel::new(&config)?;
assert!(model.is_loaded());Implementations§
Source§impl EmbeddingModel
impl EmbeddingModel
Sourcepub fn new(backend: &LlamaBackend, config: &ModelConfig) -> Result<Self>
pub fn new(backend: &LlamaBackend, config: &ModelConfig) -> Result<Self>
Creates a new embedding model from the given configuration.
§Arguments
backend- The llama backend to use for model loadingconfig- The model configuration containing path and parameters
§Returns
Returns a Result containing the initialized model or an error.
§Errors
This function will return an error if:
- The model file cannot be loaded
- The context creation fails
- Invalid configuration parameters are provided
Sourcepub fn load(backend: &LlamaBackend, config: &ModelConfig) -> Result<Self>
pub fn load(backend: &LlamaBackend, config: &ModelConfig) -> Result<Self>
Loads a model from disk.
This is an alternative way to create a model, useful when you want to explicitly separate the loading step.
§Arguments
backend- The llama backend to use for model loadingconfig- The model configuration
§Returns
Returns a Result containing the loaded model or an error.
§Errors
Returns an error if model loading fails
Sourcepub fn unload(self)
pub fn unload(self)
Consumes the model and explicitly frees resources.
Note: This happens automatically when the model is dropped. This method exists mainly for explicit resource management.
Sourcepub fn is_loaded(&self) -> bool
pub fn is_loaded(&self) -> bool
Checks if the model is currently loaded and ready for inference.
§Returns
Returns true if the model is loaded, false otherwise.
Sourcepub fn embedding_dimensions(&self) -> usize
pub fn embedding_dimensions(&self) -> usize
Returns the dimensionality of embeddings produced by this model.
§Returns
The number of dimensions in the embedding vectors.
Sourcepub fn max_sequence_length(&self) -> usize
pub fn max_sequence_length(&self) -> usize
Returns the maximum sequence length supported by this model.
§Returns
The maximum number of tokens that can be processed.
Sourcepub fn model_size(&self) -> Option<usize>
pub fn model_size(&self) -> Option<usize>
Returns the approximate memory footprint of the model in bytes.
§Returns
Estimated memory usage in bytes, or None if the size cannot be calculated
(e.g., on 32-bit platforms with very large models).
Sourcepub fn model_metadata(&self) -> (String, PathBuf, usize, usize)
pub fn model_metadata(&self) -> (String, PathBuf, usize, usize)
Returns the model’s metadata.
§Returns
A tuple containing (model_name, model_path, vocab_size, n_params).
Sourcepub fn config(&self) -> &ModelConfig
pub fn config(&self) -> &ModelConfig
Returns the model configuration.
Sourcepub fn effective_max_tokens(&self) -> usize
pub fn effective_max_tokens(&self) -> usize
Calculate the effective maximum tokens available per sequence in batch processing.
When batching multiple sequences, each sequence gets its own KV cache slot.
The usable context (n_batch) is divided among sequences based on n_seq_max.
§Returns
The maximum number of input tokens per sequence that can be safely processed.
§Implementation Note
Each sequence slot size = n_batch / n_seq_max - 2
n_batchrepresents the max usable context per sequence (defaults tocontext_size)- The division accounts for parallel sequence processing
- The 2-token overhead is for special tokens ([CLS], [SEP])
§Example
For a model with n_batch = 8192 and n_seq_max = 2:
- Per-sequence size: 8192 / 2 = 4096
- Overhead: 2 tokens ([CLS] and [SEP])
- Effective max per sequence: 4096 - 2 = 4094 tokens
Sourcepub fn tokenize_cached(
&self,
text: &str,
cache: Option<&TokenCache>,
) -> Result<Vec<LlamaToken>>
pub fn tokenize_cached( &self, text: &str, cache: Option<&TokenCache>, ) -> Result<Vec<LlamaToken>>
Sourcepub fn generate_embedding(&mut self, text: &str) -> Result<Vec<f32>>
pub fn generate_embedding(&mut self, text: &str) -> Result<Vec<f32>>
Generates an embedding for the given text.
§Arguments
text- The input text to generate embeddings for
§Returns
Returns a vector of f32 values representing the embedding.
§Errors
This function will return an error if:
- Tokenization fails
- The input exceeds the maximum token limit
- Model inference fails
Sourcepub fn generate_embedding_cached(
&mut self,
text: &str,
token_cache: Option<&TokenCache>,
truncate: TruncateTokens,
) -> Result<Vec<f32>>
pub fn generate_embedding_cached( &mut self, text: &str, token_cache: Option<&TokenCache>, truncate: TruncateTokens, ) -> Result<Vec<f32>>
Generates an embedding for the given text with optional token cache support.
§Arguments
text- The input text to generate embeddings fortoken_cache- Optional token cache for caching tokenization resultstruncate- Truncation strategy to apply
§Returns
Returns a vector of f32 values representing the embedding.
§Errors
This function will return an error if:
- Tokenization fails
- The input exceeds the maximum token limit (when truncation is disabled)
- Model inference fails
- Truncation limit exceeds model’s effective maximum
Sourcepub fn generate_multi_embedding(
&mut self,
text: &str,
token_cache: Option<&TokenCache>,
truncate: TruncateTokens,
) -> Result<Vec<Vec<f32>>>
pub fn generate_multi_embedding( &mut self, text: &str, token_cache: Option<&TokenCache>, truncate: TruncateTokens, ) -> Result<Vec<Vec<f32>>>
Generates per-token (multi-vector) embeddings for the given text.
Returns one embedding vector per token, suitable for ColBERT-style late interaction reranking. Each vector is individually normalized according to the model’s normalization mode.
§Arguments
text- The input text to generate embeddings fortoken_cache- Optional token cache for caching tokenization resultstruncate- Truncation strategy to apply
§Returns
Returns a vector of embedding vectors, one per token.
§Errors
Returns an error if tokenization or model inference fails.
Sourcepub fn process_batch_tokens(
&mut self,
token_sequences: &[Vec<LlamaToken>],
truncate: TruncateTokens,
) -> Result<Vec<Vec<f32>>>
pub fn process_batch_tokens( &mut self, token_sequences: &[Vec<LlamaToken>], truncate: TruncateTokens, ) -> Result<Vec<Vec<f32>>>
Processes multiple token sequences as a batch through the model.
This method enables true batch processing by encoding multiple sequences
in a single model pass using unique sequence IDs. If the number of sequences
exceeds n_seq_max, it will automatically chunk them.
§Arguments
token_sequences- Slice of token sequences to processtruncate- Truncation strategy to apply to each sequence
§Returns
Returns a vector of embedding vectors, one for each input sequence.
§Errors
Returns an error if:
- Context creation fails
- Batch processing fails
- Embedding extraction fails
- Pooling or normalization operations fail
- Truncation limit exceeds model’s effective maximum
Sourcepub fn process_batch_tokens_multi(
&mut self,
token_sequences: &[Vec<LlamaToken>],
truncate: TruncateTokens,
) -> Result<Vec<Vec<Vec<f32>>>>
pub fn process_batch_tokens_multi( &mut self, token_sequences: &[Vec<LlamaToken>], truncate: TruncateTokens, ) -> Result<Vec<Vec<Vec<f32>>>>
Processes multiple token sequences as a batch, returning per-token (multi-vector) embeddings.
Each input sequence produces a Vec<Vec<f32>> — one embedding per token. This is the
batch equivalent of generate_multi_embedding for ColBERT-style late interaction.
§Arguments
token_sequences- Slice of token sequences to processtruncate- Truncation strategy to apply to each sequence
§Returns
Returns a vector of multi-vector embeddings, one per input sequence.
§Errors
Returns an error if batch processing, embedding extraction, or normalization fails.
Sourcepub fn process_tokens(&mut self, tokens: &[i32]) -> Result<Vec<f32>>
pub fn process_tokens(&mut self, tokens: &[i32]) -> Result<Vec<f32>>
Processes a batch of tokens through the model.
This is a lower-level method used internally for batch processing.
§Arguments
tokens- The tokens to process
§Returns
Returns the processed embedding vector.
§Errors
Returns an error if:
- Token processing fails
- Pooling operation fails
- Normalization fails (if enabled)
Sourcepub fn generate_rerank_score(
&mut self,
query: &str,
document: &str,
truncate: TruncateTokens,
) -> Result<f32>
pub fn generate_rerank_score( &mut self, query: &str, document: &str, truncate: TruncateTokens, ) -> Result<f32>
Generates a reranking relevance score for a query-document pair.
The model encodes the concatenated query and document as a single sequence
and returns a scalar relevance score via LlamaPoolingType::Rank.
§Arguments
query- The query textdocument- The document text to score against the querytruncate- Truncation strategy for the combined input
§Returns
Returns the raw relevance score (f32). Apply sigmoid for [0,1] normalization.
§Errors
Returns an error if the model is not configured with PoolingStrategy::Rank,
tokenization fails, or model inference fails.
Sourcepub fn generate_rerank_scores_batch(
&mut self,
query: &str,
documents: &[&str],
truncate: TruncateTokens,
) -> Result<Vec<f32>>
pub fn generate_rerank_scores_batch( &mut self, query: &str, documents: &[&str], truncate: TruncateTokens, ) -> Result<Vec<f32>>
Generates reranking scores for multiple documents against a single query.
Processes multiple query-document pairs in batches for efficiency.
§Arguments
query- The query textdocuments- Slice of document texts to scoretruncate- Truncation strategy for each combined input
§Returns
Returns a vector of raw relevance scores, one per document, in input order.
§Errors
Returns an error if the model is not configured with PoolingStrategy::Rank,
tokenization fails, or model inference fails.
Sourcepub fn save_session_state(&self) -> Result<Vec<u8>>
pub fn save_session_state(&self) -> Result<Vec<u8>>
Save the current KV cache state to memory
NOTE: This is for advanced prefix caching optimization PERFORMANCE ISSUE: Only beneficial for prefixes > 100 tokens
§Errors
Returns an error if:
- The context is empty (no state to save)
- State copy operation fails
Sourcepub fn load_session_state(&mut self, state_data: &[u8]) -> Result<()>
pub fn load_session_state(&mut self, state_data: &[u8]) -> Result<()>
Load a previously saved KV cache state
NOTE: Session must be from the same model version BUG: Session format may change between llama.cpp versions
§Errors
Returns an error if:
- State data is empty
- State size check fails
Sourcepub fn generate_embedding_with_prefix(
&mut self,
text: &str,
prefix_cache: Option<&PrefixCache>,
token_cache: Option<&TokenCache>,
truncate: TruncateTokens,
) -> Result<Vec<f32>>
pub fn generate_embedding_with_prefix( &mut self, text: &str, prefix_cache: Option<&PrefixCache>, token_cache: Option<&TokenCache>, truncate: TruncateTokens, ) -> Result<Vec<f32>>
Generate embedding with prefix caching support
This method checks if the text has a common prefix that’s been cached, and if so, loads that session state to avoid recomputing the KV cache for the prefix portion.
§Arguments
text- The input text to generate embeddings forprefix_cache- Optional reference to the prefix cachetoken_cache- Optional reference to the token cachetruncate- Truncation strategy to apply
§Returns
Returns the embedding vector and optionally the number of prefix tokens used
§Errors
Returns an error if embedding generation fails or truncation limit exceeds model maximum
Trait Implementations§
Auto Trait Implementations§
impl Freeze for EmbeddingModel
impl RefUnwindSafe for EmbeddingModel
impl !Send for EmbeddingModel
impl !Sync for EmbeddingModel
impl Unpin for EmbeddingModel
impl UnsafeUnpin for EmbeddingModel
impl UnwindSafe for EmbeddingModel
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more