1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
//! Cross-encoder reranker for top-K refinement.
//!
//! ## Why this module exists
//!
//! ripvec's bi-encoder retrieval (BERT or semble) embeds query and
//! documents into a shared vector space and ranks by cosine. That's
//! cheap to scale, but the model can't express cross-token
//! interactions between query and document — each side is encoded
//! independently. On natural-language and prose corpora this caps
//! quality.
//!
//! A cross-encoder concatenates the pair `[CLS] query [SEP] doc [SEP]`
//! and runs full attention across both, producing a single relevance
//! score. Quality is meaningfully higher but cost is O(candidates),
//! so it's used only as a reranker on the bi-encoder's top-K.
//!
//! ## Architecture
//!
//! This module is a thin orchestrator: tokenize `(query, doc)` pairs,
//! delegate scoring to a [`RerankBackend`](crate::backend::RerankBackend)
//! (currently [`crate::backend::cpu::CpuRerankBackend`] — same BERT
//! trunk as the bi-encoder, plus a `Linear(hidden -> 1)` classifier
//! head + sigmoid).
//!
//! Only the CPU rerank backend is wired today. Adding GPU rerankers
//! later would require implementing `RerankBackend` for the target
//! device, mirroring `load_reranker_cpu` in `backend/mod.rs`, and
//! routing through `Reranker::from_pretrained`.
use anyhow;
use ;
use crate;
/// Default cross-encoder model.
/// `cross-encoder/ms-marco-TinyBERT-L-2-v2` (~5 MB, 2-layer
/// distilled-from-BERT-base) replaced the prior MiniLM-L-12-v2
/// default after a model sweep on the gutenberg prose benchmark
/// (15 NL queries) showed it bit-identical on NDCG@10 / recall@10
/// while running 20x faster at the warm-query path:
///
/// ```text
/// model NDCG@10 recall@10 p50
/// ms-marco-MiniLM-L-12-v2 (old) 1.0000 1.000 671 ms
/// ms-marco-MiniLM-L-6-v2 1.0000 1.000 344 ms
/// ms-marco-MiniLM-L-2-v2 0.9508 1.000 125 ms <- quality drop
/// ms-marco-TinyBERT-L-2-v2 (new) 1.0000 1.000 33 ms
/// ```
///
/// The distinction is distillation: TinyBERT-L-2 was trained with
/// teacher-distillation to preserve the larger model's behavior at
/// 2 layers, whereas plain MiniLM-L-2 sheds layers without that
/// regularization and loses precision. Two layers vs twelve cuts
/// inference cost ~6x; combined with smaller embedding dim it lands
/// at 20x in practice. Override via the CLI flag or
/// `Reranker::from_pretrained` directly when a corpus needs more
/// capacity (e.g. fine-grained domain reranking).
pub const DEFAULT_RERANK_MODEL: &str = "cross-encoder/ms-marco-TinyBERT-L-2-v2";
/// Default cap on candidates passed to the reranker.
///
/// Cost is linear in candidates. The retrieve-then-rerank literature
/// suggests 100 as a safe upper bound, but empirically — on the
/// gutenberg prose benchmark with the L-12 ms-marco cross-encoder —
/// NDCG@10 is bit-identical from K=100 all the way down to K=20
/// (recall stays at 1.000, the bi-encoder + ranking layer already
/// puts the relevant doc at rank 1 in every test query, so the
/// rerank's job is confirmation rather than reordering). 50 is a
/// 2x speedup over the literature default with enough headroom for
/// corpora where the bi-encoder is less confident; users on
/// high-confidence corpora can drop further (CLI: `--candidates 30`).
///
/// Bench (gutenberg, 15 NL queries, scope=docs, NDCG=1.000 throughout):
///
/// ```text
/// K=100 p50 1335 ms
/// K=50 p50 676 ms
/// K=30 p50 418 ms
/// K=20 p50 275 ms
/// ```
pub const DEFAULT_RERANK_CANDIDATES: usize = 50;
/// Cross-encoder reranker orchestrator.
///
/// Owns a `RerankBackend` (model trunk + classifier head) and the
/// tokenizer that produced the encodings the backend expects.
///
/// Construct via [`Self::from_pretrained`]. Use [`score_pairs`] to
/// rank candidate `(query, doc)` text pairs.
///
/// ## cfg-gating
///
/// The `backend` field type is cfg-gated by the `collapse-rerank-trait`
/// Cargo feature per the mandated pattern in
/// `docs/surgery/backend_trait_microbench.md` Section 4
/// (@Lampson (1983) "Hints for Computer System Design"):
///
/// - **default** (`collapse-rerank-trait` off): `Box<dyn RerankBackend>` —
/// heap-allocated vtable dispatch; future GPU rerankers slot in here.
/// - **Variant C** (`collapse-rerank-trait` on): [`crate::backend::cpu::CpuRerankBackend`]
/// held directly — monomorphic static dispatch; LLVM may inline through.
///
/// The call site `self.backend.score_batch(...)` is identical in source for
/// both variants; the compiler generates an indirect vtable call for Variant T
/// and a direct call for Variant C. This structural difference is what the
/// microbench measures. Anti-patterns (enum wrapping, type alias, two parallel
/// structs) are explicitly avoided per Section 4 to prevent LLVM from
/// collapsing both variants to zero overhead by construction.
///
/// [`score_pairs`]: Self::score_pairs