# forbidden-strings
Ready to publish.
A linear-time deny-list scanner for Git repos, designed for the case where the rules themselves are sensitive.
Most secret-scanning tools (gitleaks, trufflehog, secretlint) ship rules in committed config files and rely on `keywords` prefilters.
That breaks down when the forbidden literals would themselves leak if committed:
a customer name, a partner identifier, an internal codename, a pre-disclosure project name.
`forbidden-strings` keeps the rule file out of the repo entirely,
accepts it via env var or `--rules`,
and never prints the matched substring in failure output.
## What's different
- **Resharp set-algebra rules**. `A&B` (intersection) and `~(A)` (complement) are first-class.
Express "match X but not Y" without lookaround. PCRE-family engines (gitleaks, trufflehog,
secretlint, plain RE2) cannot do this; the workaround in those tools is per-rule allowlists,
which scale badly.
- **Out-of-band rules**. `--rules <path>` or `FORBIDDEN_STRINGS_RULES`.
CI materializes the rule file from a repository secret;
contributors keep a gitignored copy locally. No rule ever lives in a committed file.
- **Redacted output**. Only `path:line:cols rule=N` is printed.
The matched substring, surrounding line, and rule pattern never appear in the failure log,
so a public CI log is not itself a leak surface.
- **Linear-time matching**. Resharp is derivative-based with no backtracking;
Aho-Corasick gates the regex engine via extracted literal prefixes.
A pathological rule combination cannot exhibit catastrophic-backtracking behavior.
- **Native Rust binary, no startup tax**. Full-repo scan over 2,700 files in ~16 ms;
1,000-rule synthetic in ~27 ms; pre-commit hot path well under 5 ms.
No Node startup, no WASM init, no per-invocation TOML parse.
## When to pick something else
If the rule set can ship in the repo and set-algebra is unnecessary, gitleaks or betterleaks
is probably a better fit: larger ecosystem, SARIF output, GitHub-native code-scanning upload,
allowlist files keyed on path globs.
The niche `forbidden-strings` fills is "rules that cannot be committed AND the rule grammar
needs operators PCRE doesn't have"; pick it when one of those two is the binding constraint.
## Build
```sh
mise run //packages/cli/forbidden-strings:build
```
The release binary lands at `packages/cli/forbidden-strings/target/release/forbidden-strings`.
`hk.pkl` invokes that path directly; nothing needs to be on `$PATH`.
## Usage
```sh
# scan a specific file list (uses ./forbidden-strings.local.txt by default)
forbidden-strings path/to/file other/file
# scan every git-tracked file (respects .gitignore via git ls-files)
forbidden-strings --all
```
The rules path is resolved in this order: `--rules <PATH>` flag, then
`FORBIDDEN_STRINGS_RULES` env var, then `./forbidden-strings.local.txt`
in the current working directory.
```sh
# explicit path
forbidden-strings --rules ./other-rules.txt --all
# via env var (CI-friendly: materialize from a secret, then run)
FORBIDDEN_STRINGS_RULES=./materialized.txt forbidden-strings --all
```
## Local hook setup
`hk` replaces husky for this repo. Wire git hooks once per machine:
```sh
hk install --global # recommended; needs Git 2.54+
# or, per-repo:
hk install
```
Then `git commit` runs the `pre-commit` hook defined in `hk.pkl`,
which in turn runs this scanner against the staged files.
## CI integration (GitHub Actions)
Materialize the rule file from a repository secret at job-start;
the workflow YAML never echoes the secret value:
```yaml
- name: Materialize rule file
env:
RULES: ${{ secrets.FORBIDDEN_STRINGS_LIST }}
run: printenv RULES > forbidden-strings.local.txt
- name: Build scanner
run: mise run //packages/cli/forbidden-strings:build
- name: Scan
run: hk check --from-ref origin/main
```
Pipe via `printenv > file` rather than interpolating the secret into a `run:` block;
shell expansion in the latter can leak the value to the log even with GitHub's masking.
## Rule file format
One rule per line. Two shapes:
- A bare line is a case-sensitive literal. Match semantics depend on length:
- **Length below 7 bytes**: conditional word-boundary check (`grep -w` semantics).
A boundary is required at any end whose edge byte is a word character (`[A-Za-z0-9_]`);
the file context on that side must be either start/end of file or a non-word byte.
So a short alpha-only acronym matches a standalone occurrence in normal prose but
does **not** match coincidentally as a substring of a longer identifier or inside
random base64 noise. Path-shaped literals like `/etc/passwd` still match inside
`cat /etc/passwd` because the leading `/` is non-word so no left-side boundary
is enforced.
- **Length 7 bytes or more**: pure case-sensitive substring match, no boundary check.
A long literal matches anywhere it appears, including glued mid-identifier.
Distinctiveness from sheer length makes coincidental substring match negligible.
If a phrase exists in two written forms (with and without internal whitespace),
add both as separate rules so each matches its respective form.
- A line of the shape `/PATTERN/FLAGS` is a regex. The first `/` and last `/` delimit the pattern;
`FLAGS` is zero or more lowercase letters and is rewritten to a resharp inline-flag prefix
(e.g. `/foo/i` becomes `(?i)foo`). Use this form to opt into substring-anywhere semantics
for short literals (write the literal between the slashes), or to ban literals matching
`^/.+/[a-z]*$` (escape the slashes, e.g. ban the literal `/etc/passwd` as `/\/etc\/passwd/`).
Empty lines are ignored. Lines starting with `#` are comments.
The 7-byte threshold for word-bounded vs substring-anywhere comes from a coincidence-rate
calculation: a length-L literal in a case-sensitive alphabet of size A scanned over N random
bytes has expected coincidence count ~= N * A^(-L). At L = 7, in 1 GB of dense base64 (A = 64)
or random alphanumeric (A = 62) noise, the expected coincidence per rule is ~2.3e-4 and
~3.0e-4 respectively, comfortably under 1 across realistic repo sizes and noise types.
At L = 6 the same calculation gives ~0.015 / ~0.019, which becomes borderline once a repo
has multiple GB of dense content or 100+ deny-list rules. See `SUBSTRING_THRESHOLD` in
`src/rules.rs` for the threshold constant and derivation.
**One known regression** under the new semantics: a short literal rule will not match a
plural or suffixed form (e.g. a 3-letter acronym does not match the same acronym followed
immediately by `s`, because the trailing `s` is a word char and the boundary fails). If
plural matching is needed, express the rule as a regex with an optional trailing class:
`/ACRONYMs?/`.
### Set-algebra operators
Resharp extends standard regex with two top-level set operators that pure-PCRE engines lack:
- `A&B` -- intersection: matches strings matched by both `A` and `B`
- `~(A)` -- complement: matches strings that do NOT match `A`
Combined, these express "match X but not Y" without lookaround. Example: ban any
five-digit key except the all-zeros placeholder:
```
/key_[0-9]{5}&~(key_0{5})/
```
This flags `key_12345` and `key_99999` but lets `key_00000` through. Class-level forms
`[A&&B]` (intersection) and `[A~~B]` (symmetric difference) are also available inside
character classes.
The scanner extracts a leading literal prefix from each regex rule and folds it into a
shared Aho-Corasick gate so the regex engine only runs on files that contain the prefix.
A pattern that starts with literal bytes (`key_[0-9]{5}&~(...)` extracts `key_`) stays on
the fast path. A pattern that starts with `~(...)` or another metacharacter falls into a
smaller residual gate, still correct, just slower per file.
Extracted prefixes preserve the regex source's original UTF-8 bytes verbatim, so a rule
whose leading literal contains non-ASCII characters (em-dash `—`, smart quotes, ellipsis,
emoji) gates correctly against file content holding the same bytes; a walker that
mojibake'd those bytes during extraction would silently disable the rule by registering
a pattern AC could never match.
#### Complement-body limitations (resharp 0.5.x)
Resharp 0.5.x cannot reverse a complement whose body contains a lookaround. The
parser rewrites several surface atoms to internal lookarounds, so the following
shapes fail at compile time:
- `\b` inside a `~(...)` body. Rewritten to negative-lookahead /
negative-lookbehind by the parser, then refused. Workaround: replace `\b` with
`\W` (consumes a character on each side) or with literal whitespace, or move
the boundary check outside the complement.
- `\B` inside a `~(...)` body. Refused at parse time when the neighbours are
unclassifiable. No in-place rewrite; restructure the rule.
- Unescaped `^` or `$` inside a `~(...)` body. Rewritten to lookbehind /
lookahead in default-multiline mode and then refused. Workaround: use `\A` /
`\z` for whole-content anchors, or move the anchor outside the complement.
Inline `(?-m)` and group-scoped `(?-m:^foo$)` do NOT propagate into the
complement body, so neither works as a workaround in resharp 0.5.x.
- User-explicit lookarounds (`(?=`, `(?!`, `(?<=`, `(?<!`) inside a `~(...)`
body. Refused for the same reason as the rewritten cases. Lift the lookaround
outside the complement.
`forbidden-strings` detects every shape above at rule load time and reports the
specific trigger:
```
forbidden-strings: rule on line 42 (resharp): complement body contains \b;
resharp 0.5.x rewrites it to an internal lookaround which the reverse pass
refuses. Replace with \W ... See TROUBLESHOOTING.resharp.md for workarounds.
```
The doc at `TROUBLESHOOTING.resharp.md` in the repository root has the full
trace, more workarounds, and the upstream-issue draft.
### Supported regex flags
The flag string accepts these lowercase letters, applied via resharp's inline-flag group:
- `i` -- case-insensitive
- `m` -- multiline (`^`/`$` match at line boundaries)
- `s` -- dot-matches-newline
- `u` -- toggle Unicode `\w`/`\d` semantics
- `x` -- ignore whitespace and `#` comments inside the pattern
Resharp's parser also recognises `U` (swap greed) and `R` (CRLF line terminators),
but the validator deliberately rejects uppercase flags. Both are useless in this scanner:
`U` only affects match span length (not whether something matched), and the rare pattern
that needs CRLF-aware anchors can write `\r?$` directly. If you ever need them locally
inside one pattern, use the inline form: `(?U)foo` or `(?R)bar`.
## Output
For each violation:
```
PATH:LINE:COL_START..COL_END rule=N
```
Columns are 1-based byte offsets within the matched line.
**The matched substring is never printed.** Only the path, line number, column range,
and the opaque rule index appear in failure output; otherwise a failing CI log
becomes a leak surface. A contributor wanting to know which rule fired looks up the
index against their local rule file.
Exit codes:
- `0` -- no violations
- `1` -- one or more violations
- `2` -- usage error or rules-file error
## Performance
Measured against this repo (2,700 files, 21 MiB) on an AMD Ryzen 7 8700F:
- Realistic ruleset (17 rules), startup-only: **1.2 ms**
- Realistic ruleset, full-repo `--all`: **15.5 ms** (~1.4 GiB/s end-to-end)
- Synthetic 1,000-rule (500 literals + 500 prefixed regex), full-repo `--all`: **27 ms**
- Pathological case: 1,000,000 hits in a single 43 MB file: **1.5 s wall**
Detailed benchmarks, methodology, and the rationale for rejected optimizations
(mmap, extension pre-filtering, concat-and-scan) live in `PERF.md`.