Module encoder

Expand description

Encoder abstraction above EmbedBackend.

VectorEncoder hides the difference between transformer and static-table encoders behind one interface, so downstream search code (CLI dispatch, HybridIndex, cache layer) does not branch on encoder family.

§Two implementations

BertEncoder (P0.3) — wraps Vec<Box<dyn EmbedBackend>> + tokenizer. Used for --model bert and --model modernbert. Owns the existing walk/chunk/tokenize/embed streaming pipeline.
StaticEncoder (P1.5) — wraps [model2vec::Model2Vec]. Used for --model ripvec. CPU-only; no batching or ring buffer (table-lookup encoder is memory-bound, not compute-bound).

§Design rationale

Each implementation owns its full pipeline because transformer and static encoders have fundamentally different compute shapes:

	BERT	static
Tokenizer	HuggingFace BPE/WordPiece	model2vec internal
Inference	multi-layer attention + GEMM	embedding-table lookup
Scheduler	rayon clones (CPU) / ring buffer (GPU)	single-threaded encode
Hidden dim	384 / 768	256

Forcing a uniform “tokenize then encode” abstraction would either lie about static encoders (no real tokens to expose) or impose transformer ceremony on a lookup table. VectorEncoder instead abstracts at the repo→(chunks, embeddings) boundary, where the shapes naturally agree.

See docs/PLAN.md cluster P0 for the broader port architecture.

Re-exports§

pub use bert::BertEncoder;

Modules§

bert: BERT-family encoder: wraps the existing embed_all pipeline.
ripvec: Ripvec retrieval pipeline ported into Rust.

Traits§

VectorEncoder: Trait that abstracts text/chunks → embedding vectors.

Module encoder

Module encoder Copy item path

§Two implementations

§Design rationale

Re-exports§

Modules§

Traits§

Module encoder