Skip to main content

Module encoder

Module encoder 

Source
Expand description

Encoder abstraction above EmbedBackend.

VectorEncoder hides the difference between transformer and static-table encoders behind one interface, so downstream search code (CLI dispatch, HybridIndex, cache layer) does not branch on encoder family.

§Two implementations

  • BertEncoder (P0.3) — wraps Vec<Box<dyn EmbedBackend>> + tokenizer. Used for --model bert and --model modernbert. Owns the existing walk/chunk/tokenize/embed streaming pipeline.

  • StaticEncoder (P1.5) — wraps [model2vec::Model2Vec]. Used for --model ripvec. CPU-only; no batching or ring buffer (table-lookup encoder is memory-bound, not compute-bound).

§Design rationale

Each implementation owns its full pipeline because transformer and static encoders have fundamentally different compute shapes:

BERTstatic
TokenizerHuggingFace BPE/WordPiecemodel2vec internal
Inferencemulti-layer attention + GEMMembedding-table lookup
Schedulerrayon clones (CPU) / ring buffer (GPU)single-threaded encode
Hidden dim384 / 768256

Forcing a uniform “tokenize then encode” abstraction would either lie about static encoders (no real tokens to expose) or impose transformer ceremony on a lookup table. VectorEncoder instead abstracts at the repo→(chunks, embeddings) boundary, where the shapes naturally agree.

See docs/PLAN.md cluster P0 for the broader port architecture.

Re-exports§

pub use bert::BertEncoder;

Modules§

bert
BERT-family encoder: wraps the existing embed_all pipeline.
ripvec
Ripvec retrieval pipeline ported into Rust.

Traits§

VectorEncoder
Trait that abstracts text/chunks → embedding vectors.