1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
use extract_scope;
// What: Minimum byte length of an extracted regex prefix. Anything
// shorter is dropped from the unified AC index because short
// prefixes (like "a" or "to") fire on every file and defeat
// the gate's whole purpose.
// Why: The AC gate is meant to skip work on no-match files. A
// 1-byte "prefix" matches almost everywhere, queueing the
// full regex `find_all` for nothing.
// TS map: `const MIN_PREFIX_LEN = 3;`.
//
// In TS you'd write (pseudocode):
// ```ts
// const MIN_PREFIX_LEN = 3;
// ```
// 2026-05-03: lowered from 4 -> 3 after bench. Drains 13 of 28 residual
// rules whose leading literal is exactly 3 chars (`xox`, `pat`, `sat`,
// `ghu`/`ghs`, `r8_`, `hf_`, `SG.`, `EAA`, `.ey`, `A3-`, `A3T`). The
// trade-off is more spurious AC fires for files containing those 3-byte
// substrings (e.g. `xox` appears in code as substrings of `xxxoxxx`),
// each fire enqueues a `find_all` -- but `find_all` on a clean file is
// 5-10 us per rule, and these 3-byte substrings are rare in non-secret
// content. Net win: ~13 fewer unconditional residual scans per file,
// and the AC build / per-file scan cost grows negligibly. Two-byte
// prefixes (`SK`, `s.`) are NOT drained because they're common enough in
// real code (`static`, `sk`, `s.something`) that the spurious-AC-fire
// cost exceeds the residual-scan saving.
pub const MIN_PREFIX_LEN: usize = 3;
// What: `pub fn extract_gating_substrings(src: &str) -> Option<Vec<(String, bool)>>`
// returns a Vec of (substring, ci) pairs such that ANY successful
// regex match must contain AT LEAST ONE of them. The `ci` flag
// is per-substring -- determined by the scoped-flag context
// active at the point of extraction. A `(?i:body)` scope
// tags its substrings ci=true; a `(?-i:body)` scope tags
// them ci=false; absent flag context inherits from the
// outer rule's leading `(?i)` strip (default false).
// Returns `None` if the regex cannot be soundly
// gated -- e.g. a top-level alternation where one branch has
// no required substring at all, or the longest substring per
// branch falls below `MIN_PREFIX_LEN`.
// Why: The previous "single longest required prefix" walker missed
// the betterleaks rule shape `(?i)[\w.-]{0,50}(?:cohere|CO_API_KEY)...`,
// where the body of a required group is itself a literal
// alternation. With multi-substring gating, EACH alternation
// branch contributes its own AC pattern; all of them are
// registered against the SAME `rule_pos`. AC firing for any
// one of them queues the rule's full `find_all`. The "rule
// fires if any AC pattern in its set matches" semantics
// drains alternation-shape rules out of the residual gate
// and onto the AC fast path. PERF.md "Open opportunities".
// TS map: `function extractGatingSubstrings(src: string): Array<{ sub: string; ci: boolean }> | null`.
//
// In TS you'd write (pseudocode):
// ```ts
// function extractGatingSubstrings(src: string): Array<{ sub: string; ci: boolean }> | null {
// // 1. Strip leading `(?flags)`; record `ci` as the outer-scope context.
// // 2. Strip leading anchors `^`, `\b`, `\A`.
// // 3. Recurse via extractScope on the remainder, threading `ci` through
// // so scoped-flag groups can override it for their bodies.
// // 4. Reject if any returned substring is shorter than MIN_PREFIX_LEN.
// }
// ```
//
// Soundness contract: every returned substring must be valid UTF-8
// whose bytes match exactly what the original regex source would
// expect to find in file content. A regex literal `—` (em-dash,
// `\xe2\x80\x94`) MUST yield a substring whose `.as_bytes()` is
// `[0xe2, 0x80, 0x94]` -- NOT mojibake from per-byte casts.
// Aho-Corasick searches the file's raw bytes; if the registered
// pattern doesn't byte-for-byte match what the regex would match,
// the AC gate becomes a one-way trap-door: AC never fires, the
// regex's `find_all` is never invoked, and the rule is silently
// disabled while still appearing on the "fast path" (because a
// non-empty extraction excludes the rule from the residual gate).
// See `walk_literal_bytes` in `atom.rs` for the walker that must
// uphold this contract.