forbidden-strings 0.1.6

Out-of-band scanner for forbidden literal strings and regex patterns. Gitignore-aware, fast, dependency-light: built for CI deny-listing of leaked credentials and banned tokens.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
// What:     `use rayon::prelude::*;` brings rayon's parallel-iterator
//           traits into scope. We use it for the per-rule parallel
//           `find_all` pass on the violation slow path.
// Why:      Each regex rule has its own `Mutex<RegexInner>`, so
//           parallelizing across rules (different mutexes) is genuine
//           multi-core work, not contention.
// TS map:   No equivalent.
//
// In TS you'd write (pseudocode):
// ```ts
// // No equivalent.
// ```
use rayon::prelude::*;

// What:     `use std::collections::BTreeSet;` imports an ordered set
//           backed by a balanced binary tree. Insertions and lookups
//           are O(log n). `BTreeSet<usize>` here holds rule positions
//           that need full `find_all` after the AC pass.
// Why:      `BTreeSet` deduplicates rule positions encountered via
//           multiple AC hits AND iterates in sorted order, giving
//           deterministic per-file output ordering.
// TS map:   `new Set<number>()` -- TS sets keep insertion order; the
//           Rust BTreeSet equivalent in TS would sort manually.
//
// In TS you'd write (pseudocode):
// ```ts
// const seenRulePositions = new Set<number>();
// ```
use std::collections::BTreeSet;

// What:     `use std::sync::OnceLock;` brings the thread-safe "set
//           exactly once" cell into scope. `OnceLock<T: Send + Sync>`
//           is itself `Sync`, so a single instance can be shared by
//           reference across rayon worker threads. Concurrent callers
//           of `get_or_init` race only on the first init; the loser's
//           closure is dropped, every caller observes the same `&T`
//           afterward.
// Why:      The line-start index is built only when the first hit
//           fires (most files have zero hits and pay nothing). Once
//           built, it must be visible to AC literal-hit emission,
//           prefix-matched par_iter, and residual-shard par_iter --
//           all on the same file. OnceLock holds the index for the
//           whole `scan_content` call without making any caller pay
//           if no hit ever fires.
// TS map:   No direct equivalent. Closest pattern in TS is a lazy
//           getter that memoises into a closure variable -- TS has
//           no shared-mutable-state-with-races primitive because
//           there are no real threads.
//
// In TS you'd write (pseudocode):
// ```ts
// // Lazy memo; in single-threaded TS no synchronisation needed.
// let lineIndex: number[] | null = null;
// function getLineIndex(): number[] {
//   if (!lineIndex) lineIndex = buildLineIndex(content);
//   return lineIndex;
// }
// ```
use std::sync::OnceLock;

// What:     `use crate::rules::{is_word_byte, AcMeta, RuleSet};` imports
//           the top-level rules container, the per-AC-pattern metadata
//           tag, and the word-character classifier from the sibling
//           `rules.rs` module. `{...}` is a list import.
// Why:      `scan_content` dispatches on `AcMeta` to decide whether an
//           AC hit emits a literal-rule violation directly or queues a
//           regex rule for full evaluation; `is_word_byte` is the
//           file-side half of the conditional word-boundary check
//           (literal-side half is precomputed into `AcMeta::Literal`).
// TS map:   `import { isWordByte, AcMeta, type RuleSet } from "./rules";`.
//
// In TS you'd write (pseudocode):
// ```ts
// import { isWordByte, AcMeta, type RuleSet } from "./rules";
// ```
use crate::rules::{is_word_byte, AcMeta, ResidualShard, RuleSet};
use crate::scan_format::{build_line_index, emit_hit};

// What:     `pub fn scan_content(path: &str, content: &[u8], rs: &RuleSet) -> Vec<String>`
//           scans one file's contents against the full ruleset and
//           returns an owned `Vec` of redacted hit strings. Empty Vec
//           means clean.
// Why:      Pure function (no side effects, no I/O), one file in -> one
//           Vec out. Pure shape lets callers compose it under any
//           parallel iterator without sharing mutable state.
// TS map:   `function scanContent(path: string, content: Uint8Array, rs: RuleSet): string[]`.
//
// In TS you'd write (pseudocode):
// ```ts
// function scanContent(path: string, content: Uint8Array, rs: RuleSet): string[] {
//   const hits: string[] = [];
//   if (isLikelyBinary(content)) return hits;
//   ...
//   return hits;
// }
// ```
pub fn scan_content(path: &str, content: &[u8], rs: &RuleSet) -> Vec<String> {
    // What:     The previous binary-skip heuristic (`is_likely_binary`
    //           short-circuit on a NUL byte in the first 8 KiB) has been
    //           removed. Aho-Corasick scans raw bytes content-agnostic,
    //           and the redacted output format means "binary blob leaks
    //           secret" is a useful signal -- exactly the shape a CI
    //           deny-list scanner should catch (lockfile sidecars,
    //           bundled artifacts, accidentally-committed images).
    // Why:      Closes BUG 5. Pre-fix, a file whose content was
    //           `SECRET_NEEDLE\0...` exited the scanner with zero hits
    //           even though the literal appeared BEFORE the NUL byte;
    //           the heuristic produced silent false negatives.
    // TS map:   the removal is in-place; no remaining check.
    //
    // In TS you'd write (pseudocode):
    // ```ts
    // // (no binary-skip; scan all files unconditionally)
    // ```
    let mut hits: Vec<String> = Vec::new();

    // What:     `let mut prefix_matched: BTreeSet<usize> = BTreeSet::new();`
    //           accumulates indices into `rs.regex_rules` whose
    //           required-literal prefix was hit by the unified AC pass.
    //           BTreeSet dedupes (a prefix may appear many times in one
    //           file) and iterates in sorted order.
    // Why:      In the 99%-clean case this set stays empty and no
    //           resharp `find_all` runs. When the AC pass DOES fire a
    //           prefix hit, we run `find_all` exactly once per matching
    //           rule -- not once per AC hit position.
    // TS map:   `const prefixMatched = new Set<number>();`.
    //
    // In TS you'd write (pseudocode):
    // ```ts
    // const prefixMatched = new Set<number>();
    // ```
    let mut prefix_matched: BTreeSet<usize> = BTreeSet::new();

    // What:     `let line_index: OnceLock<Vec<usize>> = OnceLock::new();`
    //           creates an empty thread-safe one-shot cell. Calling
    //           `line_index.get_or_init(closure)` initialises the cell
    //           on first call (running `closure`), and on every later
    //           call returns the stored `&Vec<usize>` without rerunning.
    //           Concurrent racing get_or_init's on multiple rayon
    //           workers all observe the same `&Vec<usize>` afterward.
    // Why:      Build-line-index is O(file size); pay it at most once
    //           per file, only on the first hit, and share across the
    //           AC literal-emit path, the prefix-matched par_iter, and
    //           every residual-shard par_iter.
    // TS map:   No 1:1; closest is a memoised lazy getter.
    //
    // In TS you'd write (pseudocode):
    // ```ts
    // let lineIndex: number[] | null = null;
    // const getLineIndex = () => lineIndex ??= buildLineIndex(content);
    // ```
    let line_index: OnceLock<Vec<usize>> = OnceLock::new();

    // Unified AC: scans for literal rules AND required-literal prefixes
    // of regex rules in a single linear pass. AC's Standard match kind
    // exposes `find_overlapping_iter` so a longer literal at the same
    // position as a shorter regex-prefix doesn't suppress the prefix
    // hit -- without overlapping, a regex rule whose prefix coincides
    // with a literal rule's full text would never trigger.
    if let Some(ac) = &rs.ac {
        // What:     `for m in ac.find_overlapping_iter(content) { ... }`
        //           iterates EVERY (pattern, position) pair in the
        //           content where a pattern matches, regardless of
        //           overlap. `m.pattern().as_usize()` is the AC-internal
        //           id assigned at build time, used here to index
        //           `rs.ac_meta`.
        // Why:      We need both literal-rule emissions AND regex-prefix
        //           queueing to fire from the same scan pass.
        // TS map:   `for (const m of ac.findOverlappingIter(content)) { ... }`.
        //
        // In TS you'd write (pseudocode):
        // ```ts
        // for (const m of ac.findOverlappingIter(content)) {
        //   const meta = rs.acMeta[m.pattern];
        //   if (meta.kind === "literal") {
        //     hits.push(formatHit(path, ..., meta.idx));
        //   } else {
        //     prefixMatched.add(meta.rulePos);
        //   }
        // }
        // ```
        for m in ac.find_overlapping_iter(content) {
            let pid = m.pattern().as_usize();
            match &rs.ac_meta[pid] {
                AcMeta::Literal { idx, bound_left, bound_right } => {
                    // What:     Conditional word-boundary check (mirrors
                    //           `grep -w`). Each side is enforced ONLY
                    //           when the literal's edge byte is itself a
                    //           word character. The check passes when
                    //           the file context on that side is either
                    //           absent (start/end of file) or non-word.
                    //           So a short alpha-only acronym rejects a
                    //           hit when both surrounding chars are
                    //           also word chars, but a path-shaped
                    //           literal like `/etc/passwd` still matches
                    //           inside `cat /etc/passwd` because the
                    //           literal's left edge is `/` (non-word)
                    //           so no left-side boundary is enforced;
                    //           the trailing space/EOF satisfies the
                    //           right-side boundary against the `d`
                    //           edge byte.
                    //           Boundaries are pre-computed at load
                    //           time and ALSO disabled entirely when
                    //           the literal is at least
                    //           `SUBSTRING_THRESHOLD` bytes long --
                    //           long literals are distinctive enough
                    //           that coincidental substring match is
                    //           negligible (math in `rules.rs`).
                    // Why:      The original "any AC hit fires" semantics
                    //           false-positived on coincidental
                    //           substrings inside base64 blobs and
                    //           similar high-entropy noise.
                    // TS map:   `if (boundLeft && start > 0 && isWordByte(content[start - 1])) continue;` etc.
                    //
                    // In TS you'd write (pseudocode):
                    // ```ts
                    // if (boundLeft && m.start > 0 && isWordByte(content[m.start - 1])) continue;
                    // if (boundRight && m.end < content.length && isWordByte(content[m.end])) continue;
                    // ```
                    if *bound_left
                        && m.start() > 0
                        && is_word_byte(content[m.start() - 1])
                    {
                        continue;
                    }
                    if *bound_right
                        && m.end() < content.len()
                        && is_word_byte(content[m.end()])
                    {
                        continue;
                    }
                    let li = line_index.get_or_init(|| build_line_index(content));
                    hits.push(emit_hit(li, path, m.start(), m.end(), *idx));
                }
                AcMeta::RegexPrefix { rule_pos } => {
                    prefix_matched.insert(*rule_pos);
                }
            }
        }
    }

    // What:     `if let Some(ac_ci) = &rs.ac_ci { ... }` runs the
    //           parallel case-insensitive AC pass. Each hit is a regex-
    //           rule prefix (literal rules never live here), so we only
    //           queue rule positions; no direct literal emission.
    // Why:      Case-insensitive prefixes from `(?i)`-flagged regex
    //           rules ride this AC. Without it, those rules would fall
    //           into the residual sharded gate and serialize through
    //           a shared `Mutex<RegexInner>` per shard on every file.
    //           See PERF.md: 145 leading-(?i) betterleaks rules
    //           dominate the residual cost on this corpus.
    // TS map:   `for (const m of ac_ci.findOverlappingIter(content)) { ... }`.
    //
    // In TS you'd write (pseudocode):
    // ```ts
    // if (rs.acCi) {
    //   for (const m of rs.acCi.findOverlappingIter(content)) {
    //     const meta = rs.acMetaCi[m.pattern];
    //     prefixMatched.add(meta.rulePos);
    //   }
    // }
    // ```
    if let Some(ac_ci) = &rs.ac_ci {
        for m in ac_ci.find_overlapping_iter(content) {
            let pid = m.pattern().as_usize();
            match &rs.ac_meta_ci[pid] {
                AcMeta::Literal { .. } => {
                    // unreachable: literal rules never enter the ci AC.
                    // Conservative no-op rather than panic.
                }
                AcMeta::RegexPrefix { rule_pos } => {
                    prefix_matched.insert(*rule_pos);
                }
            }
        }
    }

    // For each regex rule whose prefix fired, run its full `find_all`.
    // `prefix_matched` is small (typically 0 in 99% of files; on a
    // matching file, just the few rules whose literal prefix appeared).
    if !prefix_matched.is_empty() {
        // What:     `prefix_matched.iter().copied().collect::<Vec<usize>>()`
        //           materializes the BTreeSet into a Vec so we can
        //           parallelize over it with rayon. `copied()` turns
        //           the iterator of `&usize` into one of `usize`.
        // Why:      `BTreeSet::par_iter` exists but emits `&usize`
        //           which is harder to thread through closures than
        //           owned values; the materialize-and-par_iter pattern
        //           keeps the closure simple.
        // TS map:   `[...prefixMatched]` -- arrays parallelize via
        //           Promise.all.
        //
        // In TS you'd write (pseudocode):
        // ```ts
        // const positions = [...prefixMatched];
        // const regexHits = (await Promise.all(
        //   positions.map((pos) => scanOneRule(rs.regexRules[pos], content, path))
        // )).flat();
        // ```
        let positions: Vec<usize> = prefix_matched.iter().copied().collect();
        let regex_hits: Vec<String> = positions
            .par_iter()
            .flat_map_iter(|&pos| {
                let rr = &rs.regex_rules[pos];
                let mut local: Vec<String> = Vec::new();
                // What:     `match rr.re.find_all(content) { Ok(...) =>
                //           ..., Err(()) => synthetic-hit }`. BUG 7 fix:
                //           the engine layer returns `Result<_, ()>` so
                //           callers can detect refusal. On `Err` push a
                //           per-rule synthetic hit (`rule=N engine
                //           error`) into the local hits so the file
                //           cannot exit clean when the engine refused
                //           to evaluate.
                // Why:      Pre-fix `if let Ok(...)` silently dropped the
                //           `Err` arm, so a rule that hit a resharp
                //           runtime limit reported zero hits -- fail-
                //           open against a secret-scanning tool.
                // TS map:   `const r = rr.re.findAll(content); if (r.ok)
                //           { ... } else { local.push(...) }`.
                match rr.re.find_all(content) {
                    Ok(matches) => {
                        let li = line_index.get_or_init(|| build_line_index(content));
                        for m in matches {
                            if m.start == m.end {
                                continue;
                            }
                            local.push(emit_hit(li, path, m.start, m.end, rr.idx));
                        }
                    }
                    Err(()) => {
                        local.push(format!(
                            "{}: rule={} engine error",
                            path, rr.idx
                        ));
                    }
                }
                local
            })
            .collect();
        hits.extend(regex_hits);
    }

    // Residual bucket: regex rules whose gating substrings could NOT be
    // extracted. Sharded so each shard's combined-alternation Regex
    // stays under resharp's parse/algebra cliff (see
    // `rules.rs::build_residual_shards`). The shard variants:
    //
    // - `Single { rule_pos }`: the rule's own Regex IS the gate -- skip
    //   the redundant gate.is_match and call find_all directly on the
    //   rule's compiled Regex from `regex_rules`. find_all on a clean
    //   file is similar cost to is_match; on a matching file it's the
    //   work we'd do anyway. Net: ~half the per-file scan cost vs the
    //   original "gate.is_match then rule.find_all" pair.
    //
    // - `Combined { gate, positions }`: keep the gate.is_match short-
    //   circuit so a multi-rule shard fans out to find_all only when
    //   the gate fires, saving N-1 is_match probes.
    for shard in &rs.residual_shards {
        match shard {
            ResidualShard::Single { rule_pos } => {
                let rr = &rs.regex_rules[*rule_pos];
                // What:     Same Result-pattern as the prefix-matched
                //           loop above. On `Err` emit a synthetic hit so
                //           the file cannot exit clean when the engine
                //           refused to evaluate this rule.
                // Why:      BUG 7: a Single shard whose rule errored out
                //           silently produced zero hits under the
                //           pre-fix `if let Ok(...)` arm.
                match rr.re.find_all(content) {
                    Ok(matches) => {
                        if !matches.is_empty() {
                            let li = line_index.get_or_init(|| build_line_index(content));
                            for m in matches {
                                if m.start == m.end {
                                    continue;
                                }
                                hits.push(emit_hit(li, path, m.start, m.end, rr.idx));
                            }
                        }
                    }
                    Err(()) => {
                        hits.push(format!(
                            "{}: rule={} engine error",
                            path, rr.idx
                        ));
                    }
                }
            }
            ResidualShard::Combined { gate, positions } => {
                // What:     The Combined-shard gate's `is_match` now
                //           returns `Result<bool, ()>`. `Ok(true)` fans
                //           out to per-member `find_all`; `Ok(false)`
                //           short-circuits the shard (no member can
                //           match if the union does not). `Err(())` is
                //           the new BUG 7 fallback: if the gate refused
                //           to evaluate, we cannot trust the short-
                //           circuit, so run every member's `find_all`
                //           individually. The synthetic-hit path inside
                //           the per-member loop already handles any
                //           per-member errors.
                // Why:      Without the fallback, an errored gate would
                //           silently skip the entire shard -- exactly
                //           the fail-open shape the bug describes.
                let gate_result = gate.is_match(content);
                let should_evaluate = matches!(gate_result, Ok(true) | Err(()));
                if should_evaluate {
                    let regex_hits: Vec<String> = positions
                        .par_iter()
                        .flat_map_iter(|&pos| {
                            let rr = &rs.regex_rules[pos];
                            let mut local: Vec<String> = Vec::new();
                            match rr.re.find_all(content) {
                                Ok(matches) => {
                                    let li = line_index.get_or_init(|| build_line_index(content));
                                    for m in matches {
                                        if m.start == m.end {
                                            continue;
                                        }
                                        local.push(emit_hit(li, path, m.start, m.end, rr.idx));
                                    }
                                }
                                Err(()) => {
                                    local.push(format!(
                                        "{}: rule={} engine error",
                                        path, rr.idx
                                    ));
                                }
                            }
                            local
                        })
                        .collect();
                    hits.extend(regex_hits);
                }
            }
        }
    }

    hits
}