1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
//! Query-embedding alignment for hybrid rule retrieval.
//!
//! Split out of `rules.rs`: the logic that decides HOW to embed the recall
//! query so the resulting vector can actually match the persisted index —
//! including the cold-start retry and the sustained-outage fast path. The
//! orchestration in `rules.rs` calls [`embed_query_aligned_to_index`] via
//! `super::query_embed::…`.
use crate;
use crateindex_db;
use SqlitePool;
use Duration;
/// Embed the recall query in a profile the persisted index can actually match.
///
/// When the project index's vector lane is dead relative to the active embedder
/// (`vector_lane_available == false` — e.g. a local SHA1 index while the active
/// embedder is cloud/BYOK), embedding the query with the active remote provider
/// is futile: the resulting vector lives in a different space (usually a
/// different dimension) than every indexed chunk, so the cosine pass below drops
/// them all (`query_emb.len() != c.embedding.len()`) and recall silently
/// degrades to FTS-only — *worse* than the local-hash + FTS hybrid the SHA1
/// index supports. It also forces a doomed network round-trip and prints
/// provider-failure warnings on every recall, contradicting the `status` line
/// that already reports the lane as paused.
///
/// In that state, embed locally with SHA1 instead: it matches the SHA1 index
/// dimension (restoring the hybrid lane), makes no network call, and stays
/// quiet. The same applies when there is no usable lane at all — a not-yet-built
/// or empty index returns nothing regardless of the query vector, so a remote
/// embed there is pure latency (the no-data / fresh-repo recall otherwise hangs
/// ~2.5s on a doomed cloud call before reporting "no memories").
///
/// This keys off the STATIC lane check only, deliberately. A *cloud-profiled*
/// index whose provider is briefly down (`profile_match == true`) still issues a
/// remote query embed that times out and falls back to SHA1 per-call — a bounded
/// per-query cost while genuinely down, and immediate recovery to full semantic
/// ranking the moment the provider returns. Gating this on the strict
/// recent-fallback window instead would force SHA1 (FTS-only ranking) for the
/// whole window on an otherwise-healthy lane after a single transient blip, so
/// the build-path freshness skip — not the query path — owns that window.
///
/// Cold-start recovery (`cold_start_retry`): the *first* query after an idle
/// period embeds against a cloud provider whose upstream model connection is
/// cold. That first remote embed routinely exceeds the latency-sensitive base
/// budget (e.g. the local cloud dialling OpenAI for the first time); with a bare
/// ~2.5s timeout the call is aborted and recall falls back to SHA1 even though
/// the lane is healthy — the user sees "semantic vectors paused" on a cold call
/// and full semantic ranking only on the warm follow-up. The user-initiated
/// `recall`/`search` path fixes this in two parts, gated on a healthy lane (we
/// only reach the remote embed when `vector_lane_available && !outage`). First,
/// it raises the initial attempt's ceiling to the cold-absorbing budget
/// ([`COLD_RETRY_EMBEDDING_TIMEOUT`]) so a cold round-trip completes in one shot
/// instead of being aborted at ~2.5s and re-paid; a warm query is unaffected,
/// since the budget is a ceiling and it still returns the instant the provider
/// responds. Second, if that attempt still falls back, it retries once to ride
/// out a transient transport flap (a dropped cold connection) before settling
/// for SHA1, which stays the last resort used only after both fail.
///
/// `cold_start_retry` is OFF for the latency-critical hook (800ms) and MCP
/// (1500ms) paths: those deliberately fast-degrade to lexical so a cold provider
/// never blocks the agent's edit hook or tool call. Only the human-waiting CLI
/// path opts in, where a few extra seconds to keep semantic ranking is the right
/// trade.
pub async
/// Per-query budget for the single cold-start retry in
/// [`embed_query_aligned_to_index`]. The first remote embed after an idle period
/// pays the provider's cold-connection cost (e.g. the local cloud dialling its
/// upstream model), which routinely overruns the ~2.5s interactive base budget
/// the base attempt aborts on. The retry budget is sized to clear a genuinely
/// cold single-text OpenAI embed in one shot while staying well under the 45s
/// embedding HTTP-client cap, and is bounded so a truly unreachable provider
/// still terminates one recall in seconds: the on-disk outage gate
/// (`cloud_embed_outage_active`) trips after a short run of transient failures
/// and routes subsequent queries straight to SHA1, so this larger budget is only
/// ever paid on the FIRST cold call, never repeatedly.
pub const COLD_RETRY_EMBEDDING_TIMEOUT: Duration = from_secs;