1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
//! Backend abstraction layer.
//!
//! Post-v3.0.0 the only backend is the CPU cross-encoder reranker
//! ([`cpu::CpuRerankBackend`], backed by [`cpu::CpuBertModel`]). The
//! [`RerankBackend`] trait, [`Encoding`] input type, and [`BackendKind`]
//! discriminant survive; the [`EmbedBackend`] trait and bi-encoder
//! `load_backend` / `detect_backends` entry points were removed when
//! the transformer engines came out.
// `cpu` covers CpuBertModel + CpuRerankBackend (both keep-anchors per
// the surgery's backend_split.md ยง3). The CpuBackend wrapper struct
// was removed with the bi-encoder backends; the trunk + reranker survived.
// Gate widened from `cfg(feature = "cpu")` so the macOS default build
// (which uses `cpu-accelerate`) gets the reranker.
/// Pre-tokenized encoding ready for inference.
///
/// Token IDs, attention mask, and token type IDs must all have the same length.
/// Token count is capped at `MODEL_MAX_TOKENS` (512) by the tokenizer before
/// reaching the backend.
/// Trait for cross-encoder rerank backends.
///
/// Parallel to [`EmbedBackend`], but the forward pass terminates in a
/// scalar relevance score per pair instead of a pooled vector. Used by
/// the retrieve-then-rerank pipeline: a bi-encoder ([`EmbedBackend`])
/// retrieves top-K cheaply, then [`RerankBackend`] re-scores those K
/// candidates with the cross-encoder's higher-quality cross-attention
/// over the concatenated `[CLS] query [SEP] doc [SEP]` sequence.
///
/// # Why a separate trait
///
/// Cross-encoders share BERT's trunk with bi-encoders, but the head and
/// pooling differ: bi-encoder = CLS pool + L2-normalize, cross-encoder
/// = CLS pool + linear(hidden -> 1) + sigmoid. The two return shapes are
/// incompatible (`Vec<Vec<f32>>` vs `Vec<f32>`), so unifying them under
/// a single trait would force every caller to handle an awkward sum
/// type. Sibling traits keep both call sites direct.
/// Detect available backends and load them.
///
/// The `CpuBackend` wrapper was removed with the bi-encoder backends; the embedding path is
/// excised (B6 will prune `embed.rs` and `cache/reindex.rs`). This function
/// now always returns an error. Retained here until B6 removes the `server.rs`
/// caller at line 463.
///
/// # Errors
///
/// Load a cross-encoder rerank model for CPU inference.
///
/// MS-MARCO family rerankers (the default
/// `cross-encoder/ms-marco-MiniLM-L-6-v2`) are ClassicBert-shaped, so
/// they route through [`cpu::CpuRerankBackend`] - same trunk as the
/// bi-encoder, plus a `Linear(hidden -> 1)` classifier head.
///
/// Not feature-gated like the (now-deleted) embedding backends: the rerank
/// path is load-bearing for the document-search use case (cacheless prose
/// queries) and must work in the default build. The underlying
/// `CpuRerankBackend` uses the same ndarray BLAS setup as the former
/// `CpuBackend`, so it works wherever the CPU embedding backend did -
/// `feature = "cpu"` or `feature = "cpu-accelerate"`.
///
/// # Errors
///
/// Returns an error if the model cannot be downloaded, if it lacks a
/// classifier head (i.e., the caller pointed at a bi-encoder by
/// mistake), or if the weights fail to parse.