1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
//! Vector (semantic) search index — a hand-written **exact-flat** ANN over
//! `f32` vectors, keyed by stable `u64` ids.
//!
//! Design priorities (per `plan.md`): **maximum precision** and speed, 100%
//! Rust, no C/FFI, self-contained airgapped binary.
//!
//! - **Exact, not approximate.** Every query scores against every stored
//! vector, so recall is 100% — no quantization loss, no graph-traversal
//! miss. The embedding model already spends storage on precision
//! (`jina-v2-base-code`, 768-dim); the index does not throw that away.
//! - **Cosine similarity.** Vectors are L2-normalized on insert and the query
//! is normalized per search, so the score is a plain dot product (cosine).
//! Higher score = closer.
//! - **SIMD, runtime-detected.** The per-vector dot product dispatches once
//! per search to the best kernel the *running* CPU supports: AVX-512F →
//! AVX2+FMA → scalar. The binary builds and runs everywhere; it just goes
//! faster where the silicon allows (e.g. AVX-512 on Zen 4).
//! - **int8 quantization + VNNI (G2).** Because stored vectors are
//! L2-normalized (components in `[-1,1]`), they quantize losslessly-enough to
//! `i8` (scale 127). The int8 cosine matches the f32 cosine to ~1e-2, runs on
//! a single AVX-512 **VNNI** `vpdpbusd` (64 int8 MACs/instr vs 16 f32 FMA
//! lanes), and quarters the bytes/row the (memory-bound) scoring loop streams.
//! See [`score_i8_batch`] / [`VectorIndex::search_i8`]; both fall back to a
//! scalar int8 kernel where VNNI is absent. [`bench_kernels`] times every
//! path and is exercised by `nornir vector bench`.
//! - **Multicore.** For large corpora the scoring loop is split across cores
//! via scoped threads (no `Arc`, no new dependency), each computing a local
//! top-k that is merged into the global top-k.
//!
//! Ids map back to warehouse rows (chunk id → `{repo, git_sha, model, file,
//! span, excerpt}`), so the index stays a pure derived artifact — the same
//! shape as the Tantivy full-text index, and snapshot/restore-able the same
//! way. The embedding model that produces the `f32` vectors (Candle, feature
//! `embed-tract` / `embed-ort`) is a separate layer; this module is
//! model-agnostic and
//! only cares about dimensionality.
//!
//! Cargo feature: `vector`.
// The interchangeable embedding-model registry (plan item G3). Pure-std, no
// deps — also `include!`d by `build.rs` so the model it fetches matches the one
// the embedder loads. Available whenever `vector` is on (the CLI reports the
// selected model), not just under an embed backend.
// Shared embedder support — compiled when either backend is enabled.
// Embedder backends (the `store::Embedder` trait is the interface). Both run
// the same jina code ONNX model; pick by Cargo feature.
// tract-onnx, CPU, pure Rust
// ort / ONNX Runtime, GPU (CUDA/ROCm) or CPU
// runtime CUDA-lib + onnxruntime discovery so the ort NVIDIA EP "just works"
// runtime ROCm-lib discovery + probe for the ort AMD EP (G1)
// Runtime backend selector (#9). Active only when BOTH the CPU (tract) and GPU
// (ort) embedders are compiled — the default `cargo install nornir` build — so
// the single binary picks CUDA/ROCm/CPU at runtime. With only one backend
// compiled, `load_embedder` selects it directly (below).
/// Load the default embedder as a trait object, choosing the best available
/// backend **at runtime** (#9): when both backends are compiled (the default
/// build) the [`select`] module probes the box (NVIDIA → AMD → CPU) and loads
/// the right one; otherwise the single compiled backend is loaded directly.
/// Both backends produce vectors with the same `model_profile`, so they
/// interoperate in the warehouse.
// returns disambiguate the cfg branches
/// Human-readable name of the backend [`load_embedder`] selects. In the default
/// (both-backend) build this **probes the running machine** and reports the
/// runtime choice (CUDA / ROCm / CPU); with a single backend it names it.
/// `id` of the active embedding model (from `$NORNIR_EMBED_MODEL` or the
/// registry default). Available whenever `vector` is on, for diagnostics — the
/// model is selectable independently of the backend. See
/// [`embed_registry::selected`].
/// Human-readable `"<model-name> (<dim>-dim)"` of the active model, for CLI /
/// diagnostics. Reports the registry default (jina-v2-base-code, 768-dim)
/// unless `$NORNIR_EMBED_MODEL` selects another.
// ----- vector-ANN engine ------------------------------------------------------
//
// The pure compute engine — the exact-flat `VectorIndex`, the SIMD/int8 cosine
// kernels, and the `bench_kernels` sweep — now lives in `znippy_zoomies::vann`
// (extracted verbatim; nornir already depended on znippy-zoomies for gatling).
// It is re-exported here so every existing `nornir::vector::…` path still
// resolves unchanged (`VectorIndex`, `quantize_i8`, `score_i8_batch`, …). The
// embedder glue above (chunk/store/embed*/registry) stays in nornir.
pub use ;