1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
// What: Unit tests for the five `scan_format` helpers
// (`build_line_index`, `line_and_col_indexed`,
// `end_in_line_indexed`, `format_hit`, `emit_hit`). This sidecar
// file is pulled in by
// `#[cfg(test)] #[path = "scan_format_tests.rs"] mod tests;` at
// the bottom of `scan_format.rs`, so it compiles only under
// `cargo nextest run` / `cargo test` and reaches the helpers via
// `super::`.
// Why: The four primitives and the composer each have invariants
// documented in their Gotcha blocks. The format-hit shape is a
// security-sensitive contract: the redacted output channel MUST
// NEVER include the matched substring (see README "Redacted
// output"). A silent change to any helper would either
// re-introduce BUG 5 shaped behaviour (wrong line/col reports) or
// worse, change the format string so the matched substring leaks.
// These tests pin both shapes.
// TS map: `describe("scan_format helpers", () => { ... })` in
// `scan_format.unit.test.ts`.
use build_line_index;
use emit_hit;
use end_in_line_indexed;
use format_hit;
use line_and_col_indexed;
// What: `build_line_index` on an empty input must still return
// `[0]`. The comment promises "first entry is always 0";
// every consumer (`line_and_col_indexed`,
// `end_in_line_indexed`) reads `line_starts[0]` without a
// bounds check.
// Why: Guards the partition_point/saturating_sub invariant.
// Remove the leading `starts.push(0)` and every other
// helper panics on first call.
// What: "abc\ndef" -- one `\n` at byte 3, so the index is
// [0, 4]. Length 2, NOT 3, because the file ends without
// a trailing newline.
// Why: Explicit "Gotcha" in `build_line_index`: the vec length
// is `1 + count(\n)`, not the visible line count. Pair
// this with the trailing-newline test to catch any future
// refactor that "fixes" the asymmetry.
// What: "abc\ndef\n" -- two `\n`, so the index is [0, 4, 8] and
// the final entry equals `content.len()`. Downstream
// lookups must tolerate that (see `end_in_line_indexed`'s
// `next_line_start > 0` guard).
// Why: The "trailing-newline produces len-equal entry" branch
// is what motivates the > 0 guard; locking this in stops
// a future refactor from removing the guard on the
// assumption that the last entry never equals len.
// What: `build_line_index` counts only `\n`. "abc\r\ndef" has
// one `\n` at byte 4, so the index is [0, 5]; the `\r`
// sits inside line 1 and column counting includes it.
// Why: Lock in current CRLF semantics. The scanner reports raw
// byte columns -- it does not normalise CRLF -- so a CRLF
// file's line-2-col-1 is the byte after `\n`. If we ever
// switch to CRLF-aware indexing (count `\r\n` as one
// separator), this test breaks and forces a deliberate
// update.
// What: Pins the saturating_sub invariant. `build_line_index`
// always pushes 0 first, so for any valid offset the
// predicate `|&s| s <= offset` is true at index 0,
// partition_point >= 1, and saturating_sub is just
// `result - 1`. The saturation branch is effectively
// unreachable.
// Why: If a future refactor removes the leading `starts.push(0)`,
// partition_point at offset 0 returns 0, saturating_sub
// pins it at 0, then `line_starts[0]` panics on the empty
// slice. This test fails first by returning a wrong shape
// before the panic path; pairs with the empty-input test
// above.
// What: Offset 3 is the `\n` byte itself in "abc\ndef". By the
// "line owns the byte at offset" rule (partition_point on
// `s <= offset` with line_starts=[0,4] at offset 3 yields
// 1, saturating_sub gives line_idx 0), the newline byte
// belongs to line 1.
// Why: A regression that switched the predicate to `s < offset`
// would misattribute the `\n` to line 2 col 0.
// What: Match span [0, 6) in "abc\ndef" would cross the `\n` at
// byte 3. `end_in_line_indexed` clamps to the newline
// position itself, returning 3 (one-past-end of the
// reported portion on line 1).
// Why: Multi-line matches are reported as their first-line
// portion; this is the core invariant that keeps
// `path:line:col_start..col_end` referring to a single
// source line.
// What: Empty span (start == end). The function must return
// `end` cleanly. `emit_hit`'s Gotcha says the three regex
// emission sites in scan.rs guard `start == end` BEFORE
// calling emit_hit, but the helper itself should not
// panic or return a negative-shaped value if a guard is
// ever missed.
// Why: Defence in depth around the redacted-output contract;
// an empty span that produced a wrong col_end could
// silently widen reported columns.
// What: The full format contract: `path:line:col_start..col_end
// rule=N`. The matched substring MUST NOT appear; this
// test pins the exact shape every consumer relies on.
// Why: Output format is a security-sensitive contract (see
// README "Redacted output"). A regression that interpolated
// even one byte of matched content into the format string
// would turn the CI log into a leak surface; pinning the
// five-field shape detects any such accidental widening.
// What: `emit_hit` composes line_and_col_indexed (twice) and
// end_in_line_indexed with format_hit. For "abc\ndef\nghi"
// and match [4, 7) (the literal "def" on line 2), the
// expected output is `f.txt:2:1..3 rule=3`.
// Why: Bugs in the composition (e.g. the `if end_in_line > 0`
// branch, the off-by-one on `end_in_line - 1`) survive
// unit tests of the underlying primitives. emit_hit is
// the function scan.rs actually calls; it deserves
// coverage in its own right.
// What: Span [0, 7) in "abc\ndef" would cross the `\n` at byte
// 3. emit_hit must clamp col_end to within line 1 (3
// characters: cols 1..3).
// Why: Locks in the multi-line clamp through the full
// composition path, not just `end_in_line_indexed` in
// isolation.
// What: `emit_hit`'s `if end_in_line > 0 { end_in_line - 1 } else
// { 0 }` branch matters when the match starts at offset
// 0 and is a single byte (end == 1). end_in_line is 1;
// the subtraction yields 0; line_and_col_indexed at
// offset 0 returns (1, 1).
// Why: Off-by-one regressions in the composition would shift
// col_end by 1 here, producing `f.txt:1:1..0` (nonsense)
// or `f.txt:1:1..2` (overshoot).