forbidden-strings
Linear-time deny-list scanner for Git repos, designed for the case where the rules themselves are sensitive.
Most secret-scanning tools (gitleaks, trufflehog, secretlint, betterleaks) ship rules in committed config files.
That breaks down when the forbidden literals would themselves leak if committed:
a customer name, a partner identifier, an internal codename, a pre-disclosure project name.
forbidden-strings keeps the rule file out of the repo entirely,
accepts it via env var or --rules,
and never prints the matched substring in failure output.
It is also ~20x faster per byte than betterleaks on a same-content scan,
with sub-10 ms startup that fits inside a pre-commit budget.
What's different
- Resharp set-algebra rules.
A&B(intersection) and~(A)(complement) are first-class. Express "match X but not Y" without lookaround. PCRE-family engines (gitleaks, trufflehog, secretlint, plain RE2) cannot do this; the workaround in those tools is per-rule allowlists, which scale badly. - Out-of-band rules.
--rules <path>orFORBIDDEN_STRINGS_RULES. CI materializes the rule file from a repository secret; contributors keep a gitignored copy locally. No rule ever lives in a committed file. - Redacted output. Only
path:line:cols rule=Nis printed. The matched substring, surrounding line, and rule pattern never appear in the failure log, so a public CI log is not itself a leak surface. - Linear-time matching. Resharp is derivative-based with no backtracking; Aho-Corasick gates the regex engine via extracted literal prefixes. A pathological rule combination cannot exhibit catastrophic-backtracking behavior.
- Pre-commit-budget startup. ~7 ms cold start (Rust LTO +
panic = "abort"+ stripped binary, no Node startup, no WASM init, no per-invocation TOML parse). On clean files the dual Aho-Corasick gate short-circuits before the regex engine runs at all. Betterleaks starts in ~174 ms, which exceeds typical pre-commit budgets on its own. See Performance.
When to pick something else
forbidden-strings deliberately omits features other scanners ship as core capabilities:
- CEL-based post-match filtering (entropy thresholds, BPE token efficiency, git-author predicates, file-path globs, string allowlists). Helps cut false positives when the rule corpus is broad. No equivalent here.
- Async HTTP validation. No way to call a provider API to confirm a detected secret is live. The scanner reports literal matches; staleness review is on you.
- Git history scanning. The walker enumerates working-tree files only. No equivalent of
gitleaks gitorbetterleaks gitthat scans every diff in every commit. - SARIF / JSON / CSV output. Hits go to stderr as plain text. No machine-readable format for GitHub code-scanning upload or CI dashboards.
- Per-rule path scoping. Every rule runs against every (non-skipped) file. The scanner cannot apply rule X only to YAML files.
- Per-rule allowlists. No way to say "rule X but skip when it matches in path Y."
If you need any of those, betterleaks or gitleaks is the right tool. Otherwise
forbidden-strings is faster and more expressive (set-algebra, out-of-band rules,
redacted output, native binary startup).
Prerequisites
- Rust toolchain. Install via mise:
mise install rust. - mise itself, since build commands are
mise runtasks. - For local git hooks:
hk(the hook runner) andpkl(its config language). Both are available via mise / aqua:mise install 'aqua:jdx/hk' 'aqua:apple/pkl'.
Build
The release binary lands at packages/cli/forbidden-strings/target/release/forbidden-strings.
hk.pkl invokes that path directly; nothing needs to be on $PATH.
Usage
First-run setup: copy the committed example deny-list to the conventional local filename so the scanner finds rules without further configuration:
Then:
# scan a specific file list (uses ./forbidden-strings.local.txt by default)
# scan every git-tracked file (respects .gitignore)
The rules path is resolved in this order: --rules <PATH> flag, then
FORBIDDEN_STRINGS_RULES env var, then ./forbidden-strings.local.txt
in the current working directory.
# explicit path
# via env var (CI-friendly: materialize from a secret, then run)
FORBIDDEN_STRINGS_RULES=./materialized.txt
Rule file format
One rule per line. Two shapes:
- A bare line is a case-sensitive literal. Match semantics depend on length:
- Length below 7 bytes: conditional word-boundary check (
grep -wsemantics). A boundary is required at any end whose edge byte is a word character ([A-Za-z0-9_]); the file context on that side must be either start/end of file or a non-word byte. A short alpha-only acronym matches a standalone occurrence in normal prose but does not match coincidentally as a substring of a longer identifier or inside random base64 noise. Path-shaped literals like/etc/passwdstill match insidecat /etc/passwdbecause the leading/is non-word so no left-side boundary is enforced. - Length 7 bytes or more: pure case-sensitive substring match, no boundary check. A long literal matches anywhere it appears, including glued mid-identifier. Distinctiveness from sheer length makes coincidental substring match negligible. If a phrase exists in two written forms (with and without internal whitespace), add both as separate rules so each matches its respective form.
- Length below 7 bytes: conditional word-boundary check (
- A line of the shape
/PATTERN/FLAGSis a regex. The first/and last/delimit the pattern;FLAGSis zero or more lowercase letters and is rewritten to a resharp inline-flag prefix (e.g./foo/ibecomes(?i)foo). Use this form to opt into substring-anywhere semantics for short literals (write the literal between the slashes), or to ban literals matching^/.+/[a-z]*$(escape the slashes, e.g. ban the literal/etc/passwdas/\/etc\/passwd/).
Empty lines are ignored. Lines starting with # are comments.
The 7-byte threshold has a coincidence-rate justification; see Architecture below
for the derivation and SUBSTRING_THRESHOLD in src/rules/types.rs for the constant.
One known regression under these semantics: a short literal rule will not match a
plural or suffixed form (a 3-letter acronym does not match the same acronym followed
immediately by s, because the trailing s is a word char and the boundary fails). If
plural matching is needed, express the rule as a regex with an optional trailing class:
/ACRONYMs?/.
Set-algebra operators
Resharp extends standard regex with two top-level set operators that pure-PCRE engines lack:
A&B: intersection. Matches strings matched by bothAandB.~(A): complement. Matches strings that do NOT matchA.
Combined, these express "match X but not Y" without lookaround. Example: ban any five-digit key except the all-zeros placeholder:
/key_[0-9]{5}&~(key_0{5})/
This flags key_12345 and key_99999 but lets key_00000 through. Class-level forms
[A&&B] (intersection) and [A~~B] (symmetric difference) are also available inside
character classes.
Underscore is a resharp meta character. Unescaped _ is the top pattern, which matches
any single codepoint. Escape a literal underscore as \_, including inside algebra
operands such as ghp\_...&~(ghp\_0{36}).
The scanner extracts required literal bytes from regex rules and folds them into a
shared Aho-Corasick gate so the regex engine only runs on files that contain a required
substring. For set-algebra rules, intersection & is a transparent separator and
complement ~(...) bodies never contribute gates because they describe excluded strings,
not required bytes. A pattern that starts with literal bytes (key_[0-9]{5}&~(...)
extracts key_) stays on the fast path. A pattern that starts with ~(...) or another
metacharacter falls into a smaller residual gate, still correct, just slower per file.
Extracted prefixes preserve the regex source's original UTF-8 bytes verbatim, so a rule
whose leading literal contains non-ASCII characters (em-dash —, smart quotes, ellipsis,
emoji) gates correctly against file content holding the same bytes; a walker that
mojibake'd those bytes during extraction would silently disable the rule by registering
a pattern AC could never match.
Complement-body limitations (resharp 0.5.x)
Resharp 0.5.x cannot reverse a complement whose body contains a lookaround. The parser rewrites several surface atoms to internal lookarounds, so the following shapes fail at compile time:
\binside a~(...)body. Rewritten to negative-lookahead / negative-lookbehind by the parser, then refused. Workaround: replace\bwith\W(consumes a character on each side) or with literal whitespace, or move the boundary check outside the complement.\Binside a~(...)body. Refused at parse time when the neighbours are unclassifiable. No in-place rewrite; restructure the rule.- Unescaped
^or$inside a~(...)body. Rewritten to lookbehind / lookahead in default-multiline mode and then refused. Workaround: use\A/\zfor whole-content anchors, or move the anchor outside the complement. Inline(?-m)and group-scoped(?-m:^foo$)do NOT propagate into the complement body, so neither works as a workaround in resharp 0.5.x. - User-explicit lookarounds (
(?=,(?!,(?<=,(?<!) inside a~(...)body. Refused for the same reason as the rewritten cases. Lift the lookaround outside the complement.
forbidden-strings detects every shape above at rule load time and reports the
specific trigger:
forbidden-strings: rule on line 42 (resharp): complement body contains \b;
resharp 0.5.x rewrites it to an internal lookaround which the reverse pass
refuses. Replace with \W ... See TROUBLESHOOTING.resharp.md for workarounds.
The doc at TROUBLESHOOTING.resharp.md in the repository root has the full
trace, more workarounds, and the upstream-issue draft.
Supported regex flags
The flag string accepts these lowercase letters, applied via resharp's inline-flag group:
i: case-insensitive.m: multiline (^/$match at line boundaries).s: dot-matches-newline.u: toggle Unicode\w/\dsemantics.x: ignore whitespace and#comments inside the pattern.
Resharp's parser also recognises U (swap greed) and R (CRLF line terminators),
but the validator deliberately rejects uppercase flags. Both are useless in this scanner:
U only affects match span length (not whether something matched), and the rare pattern
that needs CRLF-aware anchors can write \r?$ directly. If you ever need them locally
inside one pattern, use the inline form: (?U)foo or (?R)bar.
Integration
Local (hk)
hk replaces husky for this repo. Wire git hooks once per machine:
# or, per-repo:
hk.pkl registers forbidden-strings for the pre-commit, pre-push, and check
hooks, so every commit, every push, and every explicit hk check runs the scanner
against the relevant files.
GitHub Actions
Materialize the rule file from a repository secret at job-start; the workflow YAML never echoes the secret value:
- name: Materialize rule file
env:
RULES: ${{ secrets.FORBIDDEN_STRINGS_LIST }}
run: |
cp forbidden-strings.local.example.txt forbidden-strings.local.txt
if [ -n "$RULES" ]; then
printenv RULES >> forbidden-strings.local.txt
fi
- name: Build scanner
run: mise run //packages/cli/forbidden-strings:build
- name: Scan (PR / merge_group)
if: github.event_name != 'push'
run: mise exec -- hk check --from-ref origin/main
- name: Scan (push to main)
if: github.event_name == 'push'
run: |
packages/cli/forbidden-strings/target/release/forbidden-strings \
--rules forbidden-strings.local.txt --all
Pipe via printenv > file rather than interpolating the secret into a run: block;
shell expansion in the latter can leak the value to the log even with GitHub's masking.
The full workflow lives at .github/workflows/forbidden-strings.yml; PR / merge_group
events run hk check against changed files, push-to-main additionally runs --all.
Output
For each violation:
PATH:LINE:COL_START..COL_END rule=N
Columns are 1-based byte offsets within the matched line. The matched substring is never printed. Only the path, line number, column range, and the opaque rule index appear in failure output; otherwise a failing CI log becomes a leak surface. A contributor wanting to know which rule fired looks up the index against their local rule file.
Exit codes:
0: no violations.1: one or more violations.2: usage error or rules-file error.
Performance
Measured on an AMD Ryzen 7 8700F (16 threads). Full bench methodology and per-version
regression history are in PERF.md.
This repo (Monochromatic)
2,860 git-tracked files, 19.8 MiB total. 30 runs, hyperfine 1.20.0:
startup-only 9.0 ms ± 0.7 ms
--all 47.3 ms ± 2.9 ms
Linux kernel scale
Fresh shallow clone of torvalds/linux. 93,697 git-tracked files, 1.48 GiB. 5 runs:
startup-only 8.9 ms ± 0.7 ms
--all 2.250 s ± 0.253 s (~660 MiB/s wall, 11x parallelism)
vs betterleaks v1.1.2
Betterleaks is the upstream source for the bundled rule corpus
(data/betterleaks-default-config.toml), making it the most relevant baseline.
Same-content scan, --all (forbidden-strings) vs dir (betterleaks) on identical file sets:
forbidden-strings betterleaks ratio
startup 7.3 ms 174 ms ~24x
Mono same-content 28 ms 557 ms ~20x
Linux kernel 1.6 s 5.3 s ~3.3x
per-byte (kernel) ~1.0 GB/s ~0.3 GB/s ~3.3x
Real-world Monochromatic (forbidden-strings respects .gitignore; betterleaks dir
walks the full tree including node_modules/, target/, vendored content):
forbidden-strings --all 43 ms (21 MiB scanned)
betterleaks dir 86.5 s (4.28 GB scanned)
The 2000x wall-clock ratio is data-volume-dominated, but the data-volume difference is
real and user-observable: any workflow that scans the working tree has to choose between
walking the whole filesystem or honouring .gitignore, and forbidden-strings makes the
latter choice by default.
Three architectural choices account for most of the per-byte gap:
- Dual Aho-Corasick gate with lazy regex dispatch. On clean files, both AC passes
short-circuit before any regex engine runs. RE2 (betterleaks' engine) also
keyword-prefilters, but its hit path verifies against the full DFA;
forbidden-stringsonly queuesfind_allwhen an AC prefix is seen. - Hybrid engine dispatch. 257 of 259 ported rules compile via the
regexcrate, which applies memchr / Teddy literal-prefix acceleration per-rule. RE2 compiles all rules into a shared DFA that cannot apply per-rule fast paths. - Native binary startup. Rust LTO +
panic = "abort"+ stripped binary starts in ~7 ms. Go binary starts in ~174 ms (GC init, goroutine scheduler, config parse). For pre-commit hooks with sub-100 ms budgets, the startup gap alone disqualifies betterleaks.
The speed gap is not free; see "When to pick something else" for the capabilities
betterleaks ships that forbidden-strings deliberately omits.
Architecture
- Two-phase pipeline. Rule loading (regex compile + AC build) and file walking
(gitignore-aware enumeration) run concurrently via
rayon::joinsince they share no state. After both complete, files fan out across the rayon thread pool for parallel scan. - Aho-Corasick literal gate. Every literal rule and every regex rule's extracted literal prefix joins a single AC automaton. Per file, the AC pass either finds zero hits (regex engine skipped entirely) or queues a follow-up regex evaluation for each prefix hit.
- Residual-shard regex fallback. Regex rules without an extractable literal prefix
(those starting with
~(...), a metacharacter, or a class) fall into a smaller residual gate that runs unconditionally. Slower per file than the AC path but still linear-time. - Self-skip for own rule files.
--allwalks skip five basenames unconditionally so rule bodies that match their own literal text do not self-flag:forbidden-strings.local.txt,forbidden-strings.local.example.txt,forbidden-strings.append.local.txt,data/betterleaks-default-config.toml,port-betterleaks-relaxations.ts. Seeis_skipped_fileinsrc/main.rs. ignorecrate walker.--allusesignore::WalkBuilder(which honours.gitignore,.git/info/exclude, and global excludes) rather than shelling out togit ls-files. Same semantics, lower process overhead. Seesrc/walk.rs.- Bundled
data/betterleaks-default-config.toml. Upstream-vendored provenance for the betterleaks port. The committedforbidden-strings.local.example.txtis derived from it;port-betterleaks-relaxations.tsrecords the lossy translations applied during the port. - The 7-byte coincidence-rate threshold. A length-L literal in a case-sensitive
alphabet of size A scanned over N random bytes has expected coincidence count
~= N * A^(-L). At L = 7, in 1 GB of dense base64 (A = 64) or random alphanumeric
(A = 62) noise, the expected coincidence per rule is ~2.3e-4 and ~3.0e-4 respectively,
comfortably under 1 across realistic repo sizes and noise types. At L = 6 the same
calculation gives ~0.015 / ~0.019, which becomes borderline once a repo has multiple
GB of dense content or 100+ deny-list rules. The constant
SUBSTRING_THRESHOLDlives insrc/rules/types.rs.