forbidden-strings 0.1.9

# forbidden-strings

Linear-time deny-list scanner for Git repos. ~9 ms cold start, ~1 GiB/s wall throughput,
~20x faster per byte than betterleaks v1.1.2 on the same content. Sub-10 ms startup fits
inside a pre-commit budget; ~57 ms full `--all` on this repo (3,471 files, 57 MiB) fits
inside a pre-push budget.

Rules split into a committed baseline (`forbidden-strings.local.example.txt`) and a
per-repo appendix (`forbidden-strings.append.local.txt`, gitignored) or CI-only secret
(`FORBIDDEN_STRINGS_LIST`). The runtime rules file is concatenated from those sources by
file-enforcer. Matched substrings, the surrounding line, and the rule pattern are never
printed in failure output, so a rule body that would itself leak if committed (a customer
name, an unreleased project codename, a pre-disclosure partner ID) can live as an appendix
or CI secret without exposure on public CI logs.

## What's different

- **Sub-10 ms startup, ~1 GiB/s wall.** Single dated block (2026-05-16
  post-emit-hit-consolidation, hyperfine 1.20.0, AMD Ryzen 7 8700F, 16 threads): 9.4 ms cold
  start on this repo, 9.8 ms on the Linux kernel corpus, 56.6 ms full `--all` on this repo,
  1.989 s full `--all` on the kernel. Native Rust binary with `lto = true`,
  `codegen-units = 1`, `opt-level = 3`, `panic = "unwind"`, `overflow-checks = true`,
  `strip = true`; no Node startup, no WASM init, no per-invocation TOML parse. On clean
  files the dual Aho-Corasick gate short-circuits before the regex engine runs. Betterleaks
  starts in ~174 ms.
- **Linear-time matching.** Resharp is derivative-based with no backtracking; Aho-Corasick
  gates the regex engine via extracted literal prefixes. A pathological rule combination
  cannot exhibit catastrophic-backtracking behaviour.
- **Resharp set-algebra rules.** `A&B` (intersection) and `~(A)` (complement) are
  first-class. Express "match X but not Y" without lookaround. PCRE-family engines
  (gitleaks, trufflehog, secretlint, plain RE2) cannot do this; the workaround in those
  tools is per-rule allowlists, which scale badly.
- **Sensitive rules can live out-of-band.** The committed baseline holds non-sensitive
  rules; the gitignored appendix and the CI-only `FORBIDDEN_STRINGS_LIST` secret hold
  sensitive rules. Failure output never prints the matched substring, the surrounding line,
  or the rule pattern, so a rule body itself can be a secret.

## When to pick something else

`forbidden-strings` deliberately omits features other scanners ship as core capabilities:

- **CEL-based post-match filtering** (entropy thresholds, BPE token efficiency, git-author
  predicates, file-path globs, string allowlists). Helps cut false positives when the rule
  corpus is broad. No equivalent here.
- **Async HTTP validation**. No way to call a provider API to confirm a detected secret is
  live. The scanner reports literal matches; staleness review is on you.
- **Git history scanning**. The walker enumerates working-tree files only. No equivalent of
  `gitleaks git` or `betterleaks git` that scans every diff in every commit.
- **SARIF / JSON / CSV output**. Hits go to stderr as plain text. No machine-readable
  format for GitHub code-scanning upload or CI dashboards.
- **Per-rule path scoping**. Every rule runs against every (non-skipped) file. The scanner
  cannot apply rule X only to YAML files.
- **Per-rule allowlists**. No way to say "rule X but skip when it matches in path Y."
- **No streaming or stdin input.** Files only. The walker enumerates from disk; there is
  no `--stdin` mode.

If you need any of those, betterleaks or gitleaks is the right tool. Otherwise
`forbidden-strings` is faster and more expressive (set-algebra, out-of-band rules,
redacted output, native binary startup).

## Prerequisites

- **Rust toolchain**. Install via mise: `mise install rust`.
- **mise** itself, since build commands are `mise run` tasks.
- **For local git hooks**: `hk` (the hook runner) and `pkl` (its config language). Both
  are available via mise / aqua: `mise install 'aqua:jdx/hk' 'aqua:apple/pkl'`.

## Build

```sh
mise run //packages/cli/forbidden-strings:build
```

The release binary lands at `packages/cli/forbidden-strings/target/release/forbidden-strings`.
`hk.pkl` invokes that path directly; nothing needs to be on `$PATH`.

## Setup

The scanner needs exactly one rules file at scan time. How you produce it is up to you.

### Without file-enforcer (most consumers)

Put one rule per line in a file named `forbidden-strings.local.txt` at the repo root, or
pass `--rules <PATH>` / set `FORBIDDEN_STRINGS_RULES=<PATH>` to point at any other path.
That is the whole setup. Add the file to `.gitignore` if the rules themselves are
sensitive; otherwise commit it. The "Rule file format" section below describes the line
syntax. In CI, materialise the file from a secret (see "GitHub Actions" below) so the
rule bodies never enter version control.

### With file-enforcer (this monorepo's workflow)

Inside the Monochromatic monorepo, the runtime file is composed from two source files by
the `file-enforcer` task so the committed baseline and the gitignored sensitive appendix
stay separated on disk:

- `forbidden-strings.local.example.txt` — committed baseline (betterleaks port plus any
  non-sensitive rules). Regenerated by
  `packages/cli/forbidden-strings/src/mise.port-betterleaks.ts`; edit the generator, not
  the output.
- `forbidden-strings.append.local.txt` — per-repo additions. Gitignored, free-form, edited
  by hand. Place sensitive literals (codenames, customer names, partner IDs) here.
- `forbidden-strings.local.txt` — runtime file consumed by the scanner. Generated by
  file-enforcer concatenating the previous two. Do not edit directly.

Run `mise run file-enforcer` after editing either source to regenerate the runtime file.
The generator lives at `file-enforcer.config.ts:56-83`. If you fork this scanner into a
project that doesn't use file-enforcer, drop the example/append split and follow the
single-file workflow above.

## Usage

```sh
# scan a specific file list (uses ./forbidden-strings.local.txt by default)
forbidden-strings path/to/file other/file

# scan every working-tree file (.gitignore respected; .git/.jj skipped)
forbidden-strings --all
```

The rules path is resolved in this order: `--rules <PATH>` flag (highest), then
`FORBIDDEN_STRINGS_RULES` env var, then `./forbidden-strings.local.txt` in the current
working directory.

```sh
# explicit path
forbidden-strings --rules ./other-rules.txt --all

# via env var (CI-friendly: materialize from a secret, then run)
FORBIDDEN_STRINGS_RULES=./materialized.txt forbidden-strings --all

# print version and exit
forbidden-strings --version    # or -V
```

`--all` and positional files are mutually exclusive in practice: if both are passed, the
walker output silently overwrites the positional list. Use one or the other.

## Rule file format

One rule per line. Two shapes:

- A bare line is a case-sensitive literal. Match semantics depend on length:
  - **Length below 7 bytes**: conditional word-boundary check (`grep -w` semantics).
    A boundary is required at any end whose edge byte is a word character (`[A-Za-z0-9_]`);
    the file context on that side must be either start/end of file or a non-word byte.
    A short alpha-only acronym matches a standalone occurrence in normal prose but
    does **not** match coincidentally as a substring of a longer identifier or inside
    random base64 noise. Path-shaped literals like `/etc/passwd` still match inside
    `cat /etc/passwd` because the leading `/` is non-word so no left-side boundary
    is enforced.
  - **Length 7 bytes or more**: pure case-sensitive substring match, no boundary check.
    A long literal matches anywhere it appears, including glued mid-identifier.
    Distinctiveness from sheer length makes coincidental substring match negligible.
    If a phrase exists in two written forms (with and without internal whitespace),
    add both as separate rules so each matches its respective form.
- A line of the shape `/PATTERN/FLAGS` is a regex. The first `/` and last `/` delimit the
  pattern; `FLAGS` is zero or more lowercase letters and is rewritten to a resharp
  inline-flag prefix (e.g. `/foo/i` becomes `(?i)foo`). Use this form to opt into
  substring-anywhere semantics for short literals (write the literal between the slashes),
  or to ban literals matching `^/.+/[a-z]*$` (escape the slashes, e.g. ban the literal
  `/etc/passwd` as `/\/etc\/passwd/`).

Empty lines are ignored. Lines starting with `#` are comments.

The 7-byte threshold has a coincidence-rate justification; see Architecture below
for the derivation and `SUBSTRING_THRESHOLD` in `src/rules/types.rs` for the constant.

**One known regression** under these semantics: a short literal rule will not match a
plural or suffixed form (a 3-letter acronym does not match the same acronym followed
immediately by `s`, because the trailing `s` is a word char and the boundary fails). If
plural matching is needed, express the rule as a regex with an optional trailing class:
`/ACRONYMs?/`.

### Rule-file quirks

- **Whitespace and comments.** Lines are `trim()`'d before parsing
  (`src/rules/parse.rs:64`). A line containing only whitespace is ignored. A line whose
  first non-whitespace byte is `#` is a comment (`:78`). A `#` mid-line is part of the
  rule.
- **No deduplication.** Two identical rules both load and both fire; you see two hits with
  two different `rule=N` indices.
- **Uppercase-flag fallthrough (silent foot-gun).** `/foo/i` is a regex with the `i` flag.
  `/foo/I` is a *literal rule* that matches the exact substring `/foo/I`. The classifier
  rejects flag strings containing any non-`[a-z]` character (`parse.rs:150`) and silently
  falls through to literal handling (`:209`). A rule author who writes `/PAT/I` thinking
  they got case-insensitive matching has not — they now have a literal scan for the
  seven-byte string `/PAT/I`. The same applies to any uppercase or non-`[a-z]` flag
  character. There is no error or warning at load time.
- **Empty regex.** `//` parses as the regex `(?-flags:)`, which matches the empty string
  at every position. Foot-gun; do not write a bare `//` as a rule.
- **Missing or empty rules file.** `--rules /no/such/file` exits 2 with a read error. An
  empty rules file (or one that is all comments) exits 2 with `no rules loaded`.
- **UTF-8 BOM.** Not stripped. If a rules file begins with a BOM, the first rule line
  begins with `\u{FEFF}` and the rule body contains those bytes.

### Set-algebra operators

Resharp extends standard regex with two top-level set operators that pure-PCRE engines lack:

- `A&B`: intersection. Matches strings matched by both `A` and `B`.
- `~(A)`: complement. Matches strings that do NOT match `A`.

Combined, these express "match X but not Y" without lookaround. Example: ban any
five-digit key except the all-zeros placeholder:

```text
/key_[0-9]{5}&~(key_0{5})/
```

This flags `key_12345` and `key_99999` but lets `key_00000` through. Class-level forms
`[A&&B]` (intersection) and `[A~~B]` (symmetric difference) are also available inside
character classes.

Underscore is a resharp meta character. Unescaped `_` is the top pattern, which matches
any single codepoint. Escape a literal underscore as `\_`, including inside algebra
operands such as `ghp\_...&~(ghp\_0{36})`.

The scanner extracts required literal bytes from regex rules and folds them into a
shared Aho-Corasick gate so the regex engine only runs on files that contain a required
substring. For set-algebra rules, intersection `&` is a transparent separator and
complement `~(...)` bodies never contribute gates because they describe excluded strings,
not required bytes. A pattern that starts with literal bytes (`key_[0-9]{5}&~(...)`
extracts `key_`) stays on the fast path. A pattern that starts with `~(...)` or another
metacharacter falls into a smaller residual gate, still correct, just slower per file.
Extracted prefixes preserve the regex source's original UTF-8 bytes verbatim, so a rule
whose leading literal contains non-ASCII characters (em-dash `—`, smart quotes, ellipsis,
emoji) gates correctly against file content holding the same bytes; a walker that
mojibake'd those bytes during extraction would silently disable the rule by registering
a pattern AC could never match.

#### Complement-body limitations (resharp 0.5.x through 0.6.x)

Resharp 0.5.x through 0.6.x cannot reverse a complement whose body contains a
lookaround. The parser rewrites several surface atoms to internal lookarounds,
so the following shapes fail at compile time:

- `\b` inside a `~(...)` body. Rewritten to negative-lookahead /
  negative-lookbehind by the parser, then refused. Workaround: replace `\b` with
  `\W` (consumes a character on each side) or with literal whitespace, or move
  the boundary check outside the complement.
- `\B` inside a `~(...)` body. Refused at parse time when the neighbours are
  unclassifiable. No in-place rewrite; restructure the rule.
- Unescaped `^` or `$` inside a `~(...)` body. Rewritten to lookbehind /
  lookahead in default-multiline mode and then refused. Workaround: use `\A` /
  `\z` for whole-content anchors, or move the anchor outside the complement.
  Inline `(?-m)` and group-scoped `(?-m:^foo$)` do NOT propagate into the
  complement body, so neither works as a workaround.
- User-explicit lookarounds (`(?=`, `(?!`, `(?<=`, `(?<!`) inside a `~(...)`
  body. Refused for the same reason as the rewritten cases. Lift the lookaround
  outside the complement.

`forbidden-strings` detects every shape above at rule load time and reports the
specific trigger:

```text
forbidden-strings: rule on line 42 (resharp): complement body contains \b;
resharp 0.5.x through 0.6.x rewrites it to an internal lookaround which the
reverse pass refuses. Replace with \W ... See TROUBLESHOOTING.resharp.md
for workarounds.
```

The doc at `TROUBLESHOOTING.resharp.md` in the repository root has the full
trace, more workarounds, and the upstream-issue draft.

#### Additional pre-validators (May 2026)

A handful of resharp shapes provoke compile-time blowups or release-build
soundness bugs rather than clean parser refusals. The scanner rejects each one
at rule load with an explicit error naming the source line and the upstream
issue:

- **Nested complements `~(~(...))`.** Rejected pre-compile; the reverse pass
  cannot reverse-engineer two nested complements without exponential blowup.
- **Stacked quantifiers `(a+)+`, `(a*)*`, etc.** Rejected pre-compile.
- **Algebra hang shapes.** Intersection of a quantifier and a complement
  (`a+&~(...)`) and alt-lookaround sibling shapes (`(a|b(?=c))`) are rejected
  with explicit error messages naming the source line and the resharp issue.
- **Nested-lookahead overflow.** Specific shape `(?=...(?=...(?=...)))` rejected;
  resharp's reverse pass overflows past three nesting levels.
- **Intersection plus lookbehind.** Rejected by `intersection_with_lookbehind` in
  `src/rules/engine.rs`. The underlying resharp shape silently returns wrong
  matches in release builds (the debug-asserted bound is OFF in release), so
  the pre-validator is load-bearing for correctness, not just performance.

Even when a pre-validator misses a new known-bad shape, `compile_rule_src`
catches the resharp panic via `std::panic::catch_unwind` and emits
`PATH: rule=N engine error` to stderr instead of aborting; the file still scans
against every other rule. This is what the `panic = "unwind"` and
`overflow-checks = true` release-profile settings buy. See `Cargo.toml:49-97`
for the full rationale.

### Perl-class shorthand semantics

The scanner compiles rules in byte mode for speed (`regex::bytes` with
`unicode(false)`), which would normally make every Perl-class shorthand
ASCII-only. Two semantics survive that mode:

- **`\s`: Unicode-aware.** Matches every Unicode whitespace code point's
  UTF-8 bytes: ASCII whitespace (`\t \n \v \f \r` ), NBSP (U+00A0),
  ogham space (U+1680), Mongolian vowel separator (U+180E), en-quad
  through hair space (U+2000..U+200A), line/paragraph separator
  (U+2028..U+2029), narrow NBSP (U+202F), medium math space (U+205F),
  ideographic space (U+3000), zero-width NBSP (U+FEFF). Realised by
  expanding the rule source so each `\s` becomes a non-capturing
  alternation of ASCII whitespace and the multi-byte UTF-8 sequences.
  A rule like `(?i)adafruit[\s]+=` correctly matches
  `adafruit<NBSP>=` in JS/TS files.
- **`\S`, `\w`, `\W`, `\d`, `\D`, `\b`, `\B`: byte-level (ASCII).**
  Match the PCRE default (ASCII subset). For secret patterns these
  semantics match author intent: `\d{16}` for a credit card means
  ASCII digits, `\b(pat_...)` boundaries against literal prefixes
  fire on ASCII context, `[\w.-]{0,N}` optional prefixes never
  block a match. Authors who need genuinely Unicode-aware behaviour
  for these atoms can opt in with the `(?u)` flag, which routes the
  rule to the slower full-Unicode compile path.

The asymmetry between `\s` and the rest is pragmatic: `\s` has a
real bug repro (NBSP in JS/TS files) with a tractable byte-alternation
expansion, while `\W`/`\D`/`\B` have zero uses in the betterleaks
corpus and `\S`/`\w`/`\d`/`\b` are all used in shapes where
byte-level semantics produce no silent miss. See PERF.md for the
per-atom analysis.

### Supported regex flags

The flag string accepts these lowercase letters, applied via resharp's inline-flag group:

- `i`: case-insensitive.
- `m`: multiline (`^`/`$` match at line boundaries).
- `s`: dot-matches-newline.
- `u`: toggle Unicode `\w`/`\d` semantics.
- `x`: ignore whitespace and `#` comments inside the pattern.

Resharp's parser also recognises `U` (swap greed) and `R` (CRLF line terminators),
but the validator deliberately rejects uppercase flags. Both are useless in this scanner:
`U` only affects match span length (not whether something matched), and the rare pattern
that needs CRLF-aware anchors can write `\r?$` directly. If you ever need them locally
inside one pattern, use the inline form: `(?U)foo` or `(?R)bar`.

## Integration

### Local (hk)

`hk` replaces husky for this repo. Wire git hooks once per machine:

```sh
hk install --global   # recommended; needs Git 2.54+
# or, per-repo:
hk install
```

`hk.pkl` registers `forbidden-strings` for the `pre-commit`, `pre-push`, and `check`
hooks, so every commit, every push, and every explicit `hk check` runs the scanner
against the relevant files.

### GitHub Actions

Materialize the runtime rules file from the committed baseline plus the optional
repository secret, then dispatch by event type. The shape below mirrors
`.github/workflows/forbidden-strings.yml`:

```yaml
- name: Build scanner
  run: mise run //packages/cli/forbidden-strings:build

- name: Materialize deny-list
  env:
    FORBIDDEN_STRINGS_LIST: ${{ secrets.FORBIDDEN_STRINGS_LIST }}
  run: |
    cp forbidden-strings.local.example.txt forbidden-strings.local.txt
    if [ -n "$FORBIDDEN_STRINGS_LIST" ]; then
      printenv FORBIDDEN_STRINGS_LIST >> forbidden-strings.local.txt
    fi

- name: Scan (PR / merge_group)
  if: github.event_name != 'push'
  run: mise exec -- hk check --from-ref origin/main

- name: Scan (push to main)
  if: github.event_name == 'push'
  run: |
    packages/cli/forbidden-strings/target/release/forbidden-strings \
      --rules forbidden-strings.local.txt --all
```

Pipe via `printenv >> file` rather than interpolating the secret into a `run:` block;
shell expansion in the latter can leak the value to the log even with GitHub's masking.
The committed baseline always runs; the optional secret extends it. The same precedence
applies as locally: `--rules` > `FORBIDDEN_STRINGS_RULES` > `./forbidden-strings.local.txt`.
The full workflow at `.github/workflows/forbidden-strings.yml` runs `hk check` against
changed files for PR / merge_group events and additionally runs `--all` on push to main.

## Output

For each violation:

```text
PATH:LINE:COL_START..COL_END rule=N
```

Columns are 1-based byte offsets within the matched line.
**The matched substring is never printed.** Only the path, line number, column range,
and the opaque rule index appear in failure output; otherwise a failing CI log
becomes a leak surface. A contributor wanting to know which rule fired looks up the
index against their local rule file.

- **Hits go to stderr, not stdout.** Redirecting `2>/dev/null` silently loses the report.
- **Read errors are synthetic hits.** A file that cannot be opened (broken symlink,
  permission denied, deleted during scan) produces a single line
  `PATH: read error: <reason>` on stderr and contributes to the exit-1 count
  (`src/lib.rs:907-910`).
- **Engine errors are synthetic hits.** A rule that panics inside resharp at scan time
  produces `PATH: rule=N engine error` on stderr and contributes to the exit-1 count.
  Three emission points, one per phase: AC-prefix-matched par_iter, residual Single shard,
  residual Combined par_iter (`src/scan.rs:332`, `:383`, `:424`).
- **Ordering.** Within a file, hits are emitted in match order. Across files, ordering is
  rayon-scheduler-determined and stable on a given input but not alphabetic.

Exit codes:

- `0`: no violations.
- `1`: one or more violations (real hits, read errors, or engine errors).
- `2`: usage error or rules-file error.

## Walker behaviour

- **`--all` semantics.** Walks the working tree via `ignore::WalkBuilder`
  (`src/walk.rs:217-220`): `.hidden(false)` (dotfiles like `.github/`, `.npmrc` ARE
  scanned), `.ignore(false)` (the `.ignore` file is NOT consulted). The `.gitignore` file
  remains enabled — `ignore(false)` only disables the `.ignore` source; `git_ignore` is a
  separate setting. Files force-added past `.gitignore` (`git add -f`) are recovered via an
  in-process `gix-index` read of `.git/index` (`walk.rs:394-518`); no git subprocess.
- **`.git/` and `.jj/` skipped.** Internal VCS state is never scanned (filter at
  `walk.rs:220`).
- **Symlinks NOT followed.** `WalkBuilder`'s default `follow_links` is false and the
  project does not override it. Symlinked directories are not descended; symlinked files
  are visited but, on a broken target, surface as a read-error synthetic hit.
- **Non-UTF-8 paths silently dropped.** Index entries that are not valid UTF-8 are
  excluded from the walk (`walk.rs:518`); no error or warning.
- **Per-entry walker errors silently skipped.** A directory the walker cannot enter does
  not surface; only file-read errors after the walker hands off the path get reported via
  the read-error synthetic-hit path.
- **Binary-file 8 KiB tail cap.** Files whose first 8 KiB contains a NUL byte are scanned
  only in the first 8 KiB. The leading window always runs; secrets there fire. The tail
  past 8 KiB is skipped (recovers binary-scan cost from BUG 5's full-scan fix while
  preserving leading-window soundness). Constant `BIN_PROBE_SIZE = 8192` at
  `src/lib.rs:291`; logic at `:332-352`.
- **Read errors as hits.** As above (cross-reference).
- **Self-skip set.** During `--all`, four canonical paths are auto-skipped so rule bodies
  do not self-match:
  - the materialised rules file (whatever `--rules` / env var / default resolves to)
  - `packages/cli/forbidden-strings/data/betterleaks-default-config.toml`
  - `packages/cli/forbidden-strings/src/port-betterleaks-relaxations.ts`
  - `packages/cli/forbidden-strings/forbidden-strings.local.example.txt`

  The three generated-source paths are package-anchored (NOT root-anchored). Skip is via
  `std::fs::canonicalize`; paths that fail to canonicalize from the current cwd are
  silently dropped from the set. Explicit positional arguments bypass the skip entirely —
  though note that passing `--all` overwrites positional arguments, so the bypass only
  applies to the no-`--all` invocation.

  The root `forbidden-strings.local.example.txt` is NOT in the package-anchored list. It
  is normally also the materialised rules-file source path in the CI workflow (the `cp`
  step), so it ends up scanned-or-not depending on whether it canonicalises to the
  materialised file.

## Performance

Measured on an AMD Ryzen 7 8700F (16 threads). Full bench methodology and per-version
regression history are in `PERF.md`. If you change these, change `PERF.md` too.

Post-emit-hit-consolidation, 2026-05-16, hyperfine 1.20.0.

### Cold startup

```text
this repo (3,471 files, 57 MiB)         9.4 ms ± 0.8 ms
Linux kernel (93,696 files, 2.0 GiB)    9.8 ms ± 0.4 ms
```

### Full `--all`

```text
this repo                               56.6 ms ± 3.1 ms   (~6.3x parallelism)
Linux kernel                            1.989 s ± 0.246 s  (~12.2x parallelism, ~1.05 GiB/s wall)
```

### vs betterleaks v1.1.2 (same content, `--all` vs `dir`; 2026-05-03)

```text
startup ratio                           ~24x
this repo, same content                 ~20x (28 ms vs 557 ms)
Linux kernel                            ~3.3x (1.6 s vs 5.3 s)
```

### vs betterleaks v1.1.2 (full tree, default modes; 2026-05-03)

```text
this repo                               ~2000x (43 ms vs 86.5 s)
                                        dominated by .gitignore respect:
                                        21 MiB scanned vs 4.28 GB scanned
```

Three architectural choices account for most of the per-byte gap:

1. **Dual Aho-Corasick gate with lazy regex dispatch.** On clean files, both AC passes
   short-circuit before any regex engine runs. RE2 (betterleaks' engine) also
   keyword-prefilters, but its hit path verifies against the full DFA;
   `forbidden-strings` only queues `find_all` when an AC prefix is seen.
2. **Hybrid engine dispatch.** 257 of 259 ported rules compile via the `regex` crate,
   which applies memchr / Teddy literal-prefix acceleration per-rule. RE2 compiles all
   rules into a shared DFA that cannot apply per-rule fast paths.
3. **Native binary startup.** Rust LTO + `codegen-units = 1` + `opt-level = 3` +
   `panic = "unwind"` + `overflow-checks = true` + `strip = true`. Binary starts in
   ~9 ms. Go binary starts in ~174 ms (GC init, goroutine scheduler, config parse). For
   pre-commit hooks with sub-100 ms budgets, the startup gap alone disqualifies
   betterleaks. The unwind + overflow-checks pair is required for the resharp-panic
   safety wrapper to fail closed on engine corruption (Rust default release profile uses
   `panic = "abort"` and disables overflow checks; either flip leaves the scanner with a
   silent fail-open against a corrupt rule). See `Cargo.toml:49-97`.

The speed gap is not free; see "When to pick something else" for the capabilities
betterleaks ships that `forbidden-strings` deliberately omits.

## Debug

Three env vars print phase / bucket diagnostics to stderr; none affect output correctness,
so they are safe to enable in CI when investigating slow scans.

- `FORBIDDEN_STRINGS_DEBUG_TIMING=1`
  Per-phase wall time: `read_rules_file`, `classify+regex_compile`,
  `extract_gating_substrings`, `ac_build`, `residual_shards`.
- `FORBIDDEN_STRINGS_DEBUG_BUCKETS=1`
  Counts of literal rules, case-sensitive regex prefixes, case-insensitive regex
  prefixes, and residual rules (rules without an extractable literal prefix). Useful
  when tuning rule patterns to land more rules on the AC fast path.
- `FORBIDDEN_STRINGS_DEBUG_RESIDUAL_LIST=1`
  Implies `BUCKETS`. Adds the line number of every residual rule so you can look up
  which rules are paying the slower per-file scan.

## Fuzzing

Coverage-guided fuzzing for the scanner's regex routing, AC-gate extractor,
walker helpers, residual-shard partitioner, and hit formatter lives in its own
package, [`packages/fuzz/forbidden-strings`](../../fuzz/forbidden-strings/), so a
scoped nightly toolchain does not force this published crate onto nightly.
Targets are exercised locally and on demand only; CI integration is deferred.
See that package's [README](../../fuzz/forbidden-strings/README.md) for
prerequisites, the seven-target invariant list, mise commands, the
bounded-container wrapper, corpus and artifact policy, crash reproduction
guidance, and the soundness-by-revert validation step.

## Architecture

- **Two-phase pipeline.** Rule loading (regex compile + AC build) and file walking
  (gitignore-aware enumeration) run concurrently via `rayon::join` since they share no
  state. After both complete, files fan out across the rayon thread pool for parallel
  scan.
- **Aho-Corasick literal gate.** Every literal rule and every regex rule's extracted
  literal prefix joins a single AC automaton. Per file, the AC pass either finds zero
  hits (regex engine skipped entirely) or queues a follow-up regex evaluation for each
  prefix hit.
- **Residual-shard regex fallback.** Regex rules without an extractable literal prefix
  (those starting with `~(...)`, a metacharacter, or a class) fall into a smaller
  residual gate that runs unconditionally. Slower per file than the AC path but still
  linear-time.
- **Self-skip for own rule files.** `--all` walks skip a small set of paths
  unconditionally so rule bodies that match their own literal text do not
  self-flag: the materialized rules file plus four canonical
  self-match paths
  (`packages/cli/forbidden-strings/data/betterleaks-default-config.toml`,
  `packages/cli/forbidden-strings/src/port-betterleaks-relaxations.ts`,
  `forbidden-strings.local.example.txt` at repo root,
  and the rules-engine test-fixture file
  `packages/cli/forbidden-strings/src/rules/algebra_tests.rs` which
  documents an example match for the bundled set-algebra demo rule).
  Skip is path-anchored via `std::fs::canonicalize`, not basename-anchored,
  so an unrelated file named `forbidden-strings.local.txt` in a subdirectory
  is still scanned. Explicit positional arguments bypass the skip entirely.
  See `build_skip_set` / `is_walker_skipped` in `src/lib.rs`.
- **`ignore` crate walker + in-process gix-index union.** `--all` uses
  `ignore::WalkBuilder` (which honours `.gitignore`, `.git/info/exclude`, and
  global excludes) and then unions the result with an in-process
  `gix_index::File` read of `.git/index` (no git subprocess) so files that
  were force-added past `.gitignore` (`git add -f`) are still discovered.
  See `src/walk.rs:394-518`.
- **Bundled `data/betterleaks-default-config.toml`.** Upstream-vendored provenance for
  the betterleaks port. The committed `forbidden-strings.local.example.txt` is derived
  from it; `port-betterleaks-relaxations.ts` records the lossy translations applied during
  the port.
- **The 7-byte coincidence-rate threshold.** A length-L literal in a case-sensitive
  alphabet of size A scanned over N random bytes has expected coincidence count
  ~= N * A^(-L). At L = 7, in 1 GB of dense base64 (A = 64) or random alphanumeric
  (A = 62) noise, the expected coincidence per rule is ~2.3e-4 and ~3.0e-4 respectively,
  comfortably under 1 across realistic repo sizes and noise types. At L = 6 the same
  calculation gives ~0.015 / ~0.019, which becomes borderline once a repo has multiple
  GB of dense content or 100+ deny-list rules. The constant `SUBSTRING_THRESHOLD` lives
  in `src/rules/types.rs`.