pub trait VectorEncoder: Send + Sync {
// Required methods
fn embed_root(
&self,
root: &Path,
cfg: &SearchConfig,
profiler: &Profiler,
) -> Result<(Vec<CodeChunk>, Vec<Vec<f32>>)>;
fn hidden_dim(&self) -> usize;
fn identity(&self) -> &str;
}Expand description
Trait that abstracts text/chunks → embedding vectors.
Implementations own their full pipeline (walk, chunk, tokenize, encode) since transformer-family and static-table encoders have fundamentally different compute shapes (see module-level docs).
§Object safety
dyn VectorEncoder is constructible. Methods take &self and use only
concrete return types — no associated types or generic methods.
§Thread safety
Send + Sync is required because the encoder is shared across the
indexing pipeline’s rayon and channel-based workers.
Required Methods§
Sourcefn embed_root(
&self,
root: &Path,
cfg: &SearchConfig,
profiler: &Profiler,
) -> Result<(Vec<CodeChunk>, Vec<Vec<f32>>)>
fn embed_root( &self, root: &Path, cfg: &SearchConfig, profiler: &Profiler, ) -> Result<(Vec<CodeChunk>, Vec<Vec<f32>>)>
Walk root, chunk every supported file, and embed every chunk.
Returns the chunks and their embeddings in parallel order: chunk i
has embedding embeddings[i]. Implementations choose their own
chunker (BERT uses ripvec’s tree-sitter chunker; ripvec uses
ripvec’s AST-merge chunker — they emit different chunk shapes,
both projected onto CodeChunk).
cfg carries pipeline tuning (batch size, token caps, walk filters).
Static encoders ignore the transformer-specific fields (batch_size,
max_tokens) but still consult walk-related fields.
§Errors
Returns an error if file walking, chunking, tokenization, or inference fails.
Hidden dimension of the emitted embeddings.
Used by SearchIndex for the embedding
matrix shape, and by the cache layer to refuse cross-family loads
(a 256-dim semble index cannot be queried by a 768-dim ModernBERT
query).
Sourcefn identity(&self) -> &str
fn identity(&self) -> &str
Stable identifier used as the cache-manifest key.
For HuggingFace-backed encoders, the model repo string (e.g.
"nomic-ai/modernbert-embed-base", "minishlab/potion-code-16M").
The ripvec engine path does not write the cache; this is still consulted
for logging and diagnostics.