Skip to main content

VectorEncoder

Trait VectorEncoder 

Source
pub trait VectorEncoder: Send + Sync {
    // Required methods
    fn embed_root(
        &self,
        root: &Path,
        cfg: &SearchConfig,
        profiler: &Profiler,
    ) -> Result<(Vec<CodeChunk>, Vec<Vec<f32>>)>;
    fn hidden_dim(&self) -> usize;
    fn identity(&self) -> &str;
}
Expand description

Trait that abstracts text/chunks → embedding vectors.

Implementations own their full pipeline (walk, chunk, tokenize, encode) since transformer-family and static-table encoders have fundamentally different compute shapes (see module-level docs).

§Object safety

dyn VectorEncoder is constructible. Methods take &self and use only concrete return types — no associated types or generic methods.

§Thread safety

Send + Sync is required because the encoder is shared across the indexing pipeline’s rayon and channel-based workers.

Required Methods§

Source

fn embed_root( &self, root: &Path, cfg: &SearchConfig, profiler: &Profiler, ) -> Result<(Vec<CodeChunk>, Vec<Vec<f32>>)>

Walk root, chunk every supported file, and embed every chunk.

Returns the chunks and their embeddings in parallel order: chunk i has embedding embeddings[i]. Implementations choose their own chunker (BERT uses ripvec’s tree-sitter chunker; ripvec uses ripvec’s AST-merge chunker — they emit different chunk shapes, both projected onto CodeChunk).

cfg carries pipeline tuning (batch size, token caps, walk filters). Static encoders ignore the transformer-specific fields (batch_size, max_tokens) but still consult walk-related fields.

§Errors

Returns an error if file walking, chunking, tokenization, or inference fails.

Source

fn hidden_dim(&self) -> usize

Hidden dimension of the emitted embeddings.

Used by SearchIndex for the embedding matrix shape, and by the cache layer to refuse cross-family loads (a 256-dim semble index cannot be queried by a 768-dim ModernBERT query).

Source

fn identity(&self) -> &str

Stable identifier used as the cache-manifest key.

For HuggingFace-backed encoders, the model repo string (e.g. "nomic-ai/modernbert-embed-base", "minishlab/potion-code-16M"). The ripvec engine path does not write the cache; this is still consulted for logging and diagnostics.

Implementors§