1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
//! Encoder abstraction above [`EmbedBackend`](crate::backend::EmbedBackend).
//!
//! [`VectorEncoder`] hides the difference between transformer and static-table
//! encoders behind one interface, so downstream search code (CLI dispatch,
//! [`HybridIndex`](crate::hybrid::HybridIndex), cache layer) does not branch
//! on encoder family.
//!
//! ## Two implementations
//!
//! - [`BertEncoder`] (P0.3) — wraps `Vec<Box<dyn EmbedBackend>>` + tokenizer.
//! Used for `--model bert` and `--model modernbert`. Owns the existing
//! walk/chunk/tokenize/embed streaming pipeline.
//!
//! - [`StaticEncoder`](crate::encoder::ripvec::dense::StaticEncoder) (P1.5) —
//! wraps [`model2vec::Model2Vec`]. Used for `--model ripvec`. CPU-only;
//! no batching or ring buffer (table-lookup encoder is memory-bound, not
//! compute-bound).
//!
//! ## Design rationale
//!
//! Each implementation owns its full pipeline because transformer and static
//! encoders have fundamentally different compute shapes:
//!
//! | | BERT | static |
//! |---|---|---|
//! | Tokenizer | HuggingFace BPE/WordPiece | model2vec internal |
//! | Inference | multi-layer attention + GEMM | embedding-table lookup |
//! | Scheduler | rayon clones (CPU) / ring buffer (GPU) | single-threaded encode |
//! | Hidden dim | 384 / 768 | 256 |
//!
//! Forcing a uniform "tokenize then encode" abstraction would either lie
//! about static encoders (no real tokens to expose) or impose transformer
//! ceremony on a lookup table. `VectorEncoder` instead abstracts at the
//! repo→(chunks, embeddings) boundary, where the shapes naturally agree.
//!
//! See `docs/PLAN.md` cluster P0 for the broader port architecture.
use Path;
use crateCodeChunk;
use crateSearchConfig;
use crateProfiler;
pub use BertEncoder;
/// Trait that abstracts text/chunks → embedding vectors.
///
/// Implementations own their full pipeline (walk, chunk, tokenize, encode)
/// since transformer-family and static-table encoders have fundamentally
/// different compute shapes (see module-level docs).
///
/// # Object safety
///
/// `dyn VectorEncoder` is constructible. Methods take `&self` and use only
/// concrete return types — no associated types or generic methods.
///
/// # Thread safety
///
/// `Send + Sync` is required because the encoder is shared across the
/// indexing pipeline's rayon and channel-based workers.