1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
//! Free-function helpers for [`CodeIndexer`].
//!
//! Why: the original `mod.rs` bundled a mix of constant/env readers, codec
//! helpers, and score-adjustment free functions alongside the struct
//! definition and constructors. Extracting them here reduces `mod.rs` below
//! the 500-line cap while keeping each helper easy to find by concern.
//! What: env readers (`embedding_cache_cap`, `idle_evict_secs`,
//! `max_chunks_per_index`, `embed_batch_size`), codec helpers
//! (`hash_query`, `build_compact_snippet`, `resolve_chunk_file`,
//! `raw_to_code_chunk`, `populate_virtual_terms`), and score helpers
//! (`file_type_score_multiplier`, `is_struct_definition_chunk_type`,
//! `is_function_definition_chunk_type`, `definition_boost_query_tokens`,
//! `compute_match_reason`).
//! Test: see `indexer::tests` — every function here is exercised transitively
//! by the search and ingest integration tests; several have dedicated unit
//! tests (`test_embed_batch_size_env_clamp`,
//! `idle_evict_secs_default_and_env_override`, etc.).
use crateRawChunk;
use crateRawEntity;
use CodeChunk;
// ─── Batch / cache sizing ────────────────────────────────────────────────────
/// Default LRU capacity for the per-indexer chunk embedding cache.
///
/// Each entry is `dim × 4` bytes (384-dim f32 ≈ 1 536 B). 1 000 entries ≈
/// ~1.5 MB of RAM per index. Evicted entries are simply re-embedded on demand
/// (MMR rerank gracefully falls back when an embedding is missing). Lowered
/// from 10 000 → 1 000 (issue #79) after a daemon was observed at 43.9 GB RSS;
/// the cache was a meaningful contributor on multi-index hosts. Override
/// at runtime via `TRUSTY_EMBEDDING_CACHE`.
const DEFAULT_EMBEDDING_CACHE_CAP: usize = 1_000;
/// Read the embedding-cache LRU cap from the environment, with a sane default.
///
/// Why: lets operators tune the in-memory embedding LRU without a recompile.
/// What: reads `TRUSTY_EMBEDDING_CACHE` as a positive usize; falls back to
/// [`DEFAULT_EMBEDDING_CACHE_CAP`] when unset, zero, or unparseable.
/// Test: covered indirectly by every test that constructs a `CodeIndexer`.
pub
/// Default idle window (seconds) after which a durably-backed index's
/// in-memory `chunks` HashMap is evicted to reclaim heap.
pub const DEFAULT_CHUNKS_IDLE_EVICT_SECS: u64 = 300;
/// Resolve the in-memory-chunks idle-eviction window (in seconds) from the
/// environment, falling back to [`DEFAULT_CHUNKS_IDLE_EVICT_SECS`].
///
/// Why: operators on memory-constrained hosts may want a tighter window
/// (evict sooner) while large-corpus hosts that re-query frequently may want
/// to disable eviction entirely.
/// What: reads `TRUSTY_CHUNKS_IDLE_EVICT_SECS` as `u64` seconds. A value of
/// `0` **disables** idle eviction. Unset / unparseable falls back to default.
/// Test: `idle_evict_secs_default_and_env_override`.
pub
/// Default hard cap on chunks per index.
const DEFAULT_MAX_CHUNKS_PER_INDEX: usize = 200_000;
/// Read the per-index chunk cap from the environment, with a sane default.
///
/// Why: limits RSS growth on large monorepos.
/// What: reads `TRUSTY_MAX_CHUNKS` as a positive usize; falls back to
/// [`DEFAULT_MAX_CHUNKS_PER_INDEX`] when unset, zero, or unparseable.
/// Test: covered indirectly by every ingest test.
pub
/// Default safety-net batch size when `TRUSTY_MAX_BATCH_SIZE` is unset.
const DEFAULT_EMBED_BATCH_SIZE: usize = 64;
/// Floor for env-clamped batch size.
const EMBED_BATCH_MIN: usize = 32;
/// Ceiling for env-clamped batch size.
const EMBED_BATCH_MAX: usize = 512;
/// Read the embedding batch size from `TRUSTY_MAX_BATCH_SIZE`, clamped to
/// `[EMBED_BATCH_MIN, EMBED_BATCH_MAX]`. Falls back to
/// `DEFAULT_EMBED_BATCH_SIZE` when unset or unparseable.
///
/// Why: large repos can exhaust process memory if batches grow unbounded.
/// What: parses env, clamps via `.clamp()`.
/// Test: see `tests::test_embed_batch_size_env_clamp`.
pub
// ─── Codec helpers ───────────────────────────────────────────────────────────
/// Stable u64 hash of a query string. Used as the LRU cache key so we don't
/// retain the full string twice (LRU stores the embedding payload only).
///
/// Why: avoids keeping two copies of the query text in the cache.
/// What: `DefaultHasher::finish()` over `query`.
/// Test: covered indirectly by every search that hits the embedding cache.
pub
/// Build a 7-line snippet centered on the chunk content for token-efficient
/// output.
///
/// Why: long chunks are expensive in LLM prompts; a 7-line header gives enough
/// context to identify the construct without burning tokens.
/// What: returns the first 7 lines when content exceeds 7 lines; otherwise
/// returns `content` verbatim.
/// Test: covered indirectly by every search test that sets `compact: true`.
pub
/// Resolve a stored chunk `file` string to an absolute path string.
///
/// Why (issue #402): newly indexed chunks store `file` relative to
/// `root_path`. Older indexes still carry absolute paths. This helper
/// normalises both forms.
/// What: if `raw_file` starts with the OS path separator it is returned
/// as-is; otherwise `root_path.join(raw_file)` is returned.
/// Test: `tests::resolve_chunk_file_relative_becomes_absolute` and
/// `tests::resolve_chunk_file_absolute_passthrough`.
pub
/// Materialize a `RawChunk` into a `CodeChunk` with the given score, match
/// reason, and optional compact snippet.
///
/// Why: four call sites used to inline the same 18-field struct literal.
/// Consolidating removes ~60 lines of duplication.
/// What: clones every metadata field and derives `chunk_depth` (clamped to
/// u8). Resolves `raw.file` to absolute via [`resolve_chunk_file`].
/// Test: covered indirectly by every search/materialization test.
pub
/// Populate `virtual_terms` on each chunk from entities whose source line
/// falls within the chunk's `[start_line, end_line]` range.
///
/// Why: two call sites used the same dedupe-by-entity-text loop. Extracting
/// prevents drift.
/// What: for each chunk, walks `entities` once, inserting each entity's text
/// at most once into a fresh `virtual_terms` vector.
/// Test: covered by `test_virtual_terms_populated_from_entities`.
pub
// ─── Score helpers ───────────────────────────────────────────────────────────
/// Score multiplier applied to a chunk for Definition-intent queries (issue
/// #92).
///
/// Why: Definition queries should surface the canonical declaration, not doc
/// files that mention the symbol many times.
/// What: returns `0.5` for known doc/config extensions, `1.0` otherwise.
/// Test: covered by `test_file_type_multiplier_demotes_docs`.
pub
/// Structural-definition score boost for Definition-intent queries (issue
/// #117).
///
/// Why: queries with struct-name tokens were under-firing; a 2.0× multiplier
/// surfaces the canonical declaration without drowning other boosts.
/// What: a flat `2.0` multiplier applied in `apply_score_adjustments`.
/// Test: `test_struct_definition_boost_surfaces_struct_over_usage`.
pub const STRUCT_DEFINITION_BOOST: f32 = 2.0;
/// Decide whether `chunk_type` participates in the Definition-intent
/// structural boost for type declarations (issue #117).
///
/// Why: only chunks that ARE the declaration of a type are eligible.
/// What: returns `true` for `Struct`, `Enum`, `Class`, `Trait`, and
/// `TypeAlias`; `false` for everything else.
/// Test: covered indirectly by
/// `test_struct_definition_boost_surfaces_struct_over_usage`.
pub
/// Decide whether `chunk_type` participates in the Definition-intent
/// function-definition boost (issue #122).
///
/// Why: function-name queries returned usage sites at rank 1 instead of
/// the canonical declaration. Extending the boost to function-like chunks
/// closes that gap.
/// What: returns `true` for `Function` and `Method`; `false` for everything
/// else. `Constant` is excluded to avoid boosting string-literal occurrences.
/// Test: covered by
/// `test_function_definition_boost_surfaces_function_over_string_literal_usage`.
pub
/// Lowercase the meaningful query tokens for the Definition-intent structural
/// boost (issue #117).
///
/// Why: the boost only fires when a chunk's `function_name` literally matches
/// one of the query tokens. Tokenising the same way at boost-decision time
/// keeps the rule predictable and unit-testable.
/// What: splits on whitespace, drops tokens shorter than 2 characters, and
/// lowercases each remaining token.
/// Test: covered indirectly by
/// `test_struct_definition_boost_surfaces_struct_over_usage`.
pub
/// Map (`in_hnsw`, `in_bm25`, `in_kg`) booleans to a stable `match_reason`
/// label.
///
/// Why: lifted out of `search` to keep the materialization loop short and to
/// make the precedence rules unit-testable in isolation.
/// What: direct hits (HNSW and/or BM25) take precedence over KG-only paths.
/// `(false,false,false)` returns `"fallback:ripgrep"` for the grep lane.
/// Test: covered indirectly by `test_kg_expansion_marks_neighbours_with_hybrid_kg`
/// and `test_compute_match_reason_fallback_label`.
pub