forbidden-strings
Ready to publish.
A linear-time deny-list scanner for Git repos, designed for the case where the rules themselves are sensitive.
Most secret-scanning tools (gitleaks, trufflehog, secretlint) ship rules in committed config files and rely on keywords prefilters.
That breaks down when the forbidden literals would themselves leak if committed:
a customer name, a partner identifier, an internal codename, a pre-disclosure project name.
forbidden-strings keeps the rule file out of the repo entirely,
accepts it via env var or --rules,
and never prints the matched substring in failure output.
What's different
- Resharp set-algebra rules.
A&B(intersection) and~(A)(complement) are first-class. Express "match X but not Y" without lookaround. PCRE-family engines (gitleaks, trufflehog, secretlint, plain RE2) cannot do this; the workaround in those tools is per-rule allowlists, which scale badly. - Out-of-band rules.
--rules <path>orFORBIDDEN_STRINGS_RULES. CI materializes the rule file from a repository secret; contributors keep a gitignored copy locally. No rule ever lives in a committed file. - Redacted output. Only
path:line:cols rule=Nis printed. The matched substring, surrounding line, and rule pattern never appear in the failure log, so a public CI log is not itself a leak surface. - Linear-time matching. Resharp is derivative-based with no backtracking; Aho-Corasick gates the regex engine via extracted literal prefixes. A pathological rule combination cannot exhibit catastrophic-backtracking behavior.
- Native Rust binary, no startup tax. Full-repo scan over 2,700 files in ~16 ms; 1,000-rule synthetic in ~27 ms; pre-commit hot path well under 5 ms. No Node startup, no WASM init, no per-invocation TOML parse.
When to pick something else
If the rule set can ship in the repo and set-algebra is unnecessary, gitleaks or betterleaks
is probably a better fit: larger ecosystem, SARIF output, GitHub-native code-scanning upload,
allowlist files keyed on path globs.
The niche forbidden-strings fills is "rules that cannot be committed AND the rule grammar
needs operators PCRE doesn't have"; pick it when one of those two is the binding constraint.
Build
The release binary lands at packages/cli/forbidden-strings/target/release/forbidden-strings.
hk.pkl invokes that path directly; nothing needs to be on $PATH.
Usage
# scan a specific file list (uses ./forbidden-strings.local.txt by default)
# scan every git-tracked file (respects .gitignore via git ls-files)
The rules path is resolved in this order: --rules <PATH> flag, then
FORBIDDEN_STRINGS_RULES env var, then ./forbidden-strings.local.txt
in the current working directory.
# explicit path
# via env var (CI-friendly: materialize from a secret, then run)
FORBIDDEN_STRINGS_RULES=./materialized.txt
Local hook setup
hk replaces husky for this repo. Wire git hooks once per machine:
# or, per-repo:
Then git commit runs the pre-commit hook defined in hk.pkl,
which in turn runs this scanner against the staged files.
CI integration (GitHub Actions)
Materialize the rule file from a repository secret at job-start; the workflow YAML never echoes the secret value:
- name: Materialize rule file
env:
RULES: ${{ secrets.FORBIDDEN_STRINGS_LIST }}
run: printenv RULES > forbidden-strings.local.txt
- name: Build scanner
run: mise run //packages/cli/forbidden-strings:build
- name: Scan
run: hk check --from-ref origin/main
Pipe via printenv > file rather than interpolating the secret into a run: block;
shell expansion in the latter can leak the value to the log even with GitHub's masking.
Rule file format
One rule per line. Two shapes:
- A bare line is a case-sensitive literal. Match semantics depend on length:
- Length below 7 bytes: conditional word-boundary check (
grep -wsemantics). A boundary is required at any end whose edge byte is a word character ([A-Za-z0-9_]); the file context on that side must be either start/end of file or a non-word byte. So a short alpha-only acronym matches a standalone occurrence in normal prose but does not match coincidentally as a substring of a longer identifier or inside random base64 noise. Path-shaped literals like/etc/passwdstill match insidecat /etc/passwdbecause the leading/is non-word so no left-side boundary is enforced. - Length 7 bytes or more: pure case-sensitive substring match, no boundary check. A long literal matches anywhere it appears, including glued mid-identifier. Distinctiveness from sheer length makes coincidental substring match negligible. If a phrase exists in two written forms (with and without internal whitespace), add both as separate rules so each matches its respective form.
- Length below 7 bytes: conditional word-boundary check (
- A line of the shape
/PATTERN/FLAGSis a regex. The first/and last/delimit the pattern;FLAGSis zero or more lowercase letters and is rewritten to a resharp inline-flag prefix (e.g./foo/ibecomes(?i)foo). Use this form to opt into substring-anywhere semantics for short literals (write the literal between the slashes), or to ban literals matching^/.+/[a-z]*$(escape the slashes, e.g. ban the literal/etc/passwdas/\/etc\/passwd/).
Empty lines are ignored. Lines starting with # are comments.
The 7-byte threshold for word-bounded vs substring-anywhere comes from a coincidence-rate
calculation: a length-L literal in a case-sensitive alphabet of size A scanned over N random
bytes has expected coincidence count ~= N * A^(-L). At L = 7, in 1 GB of dense base64 (A = 64)
or random alphanumeric (A = 62) noise, the expected coincidence per rule is ~2.3e-4 and
~3.0e-4 respectively, comfortably under 1 across realistic repo sizes and noise types.
At L = 6 the same calculation gives ~0.015 / ~0.019, which becomes borderline once a repo
has multiple GB of dense content or 100+ deny-list rules. See SUBSTRING_THRESHOLD in
src/rules.rs for the threshold constant and derivation.
One known regression under the new semantics: a short literal rule will not match a
plural or suffixed form (e.g. a 3-letter acronym does not match the same acronym followed
immediately by s, because the trailing s is a word char and the boundary fails). If
plural matching is needed, express the rule as a regex with an optional trailing class:
/ACRONYMs?/.
Set-algebra operators
Resharp extends standard regex with two top-level set operators that pure-PCRE engines lack:
A&B-- intersection: matches strings matched by bothAandB~(A)-- complement: matches strings that do NOT matchA
Combined, these express "match X but not Y" without lookaround. Example: ban any five-digit key except the all-zeros placeholder:
/key_[0-9]{5}&~(key_0{5})/
This flags key_12345 and key_99999 but lets key_00000 through. Class-level forms
[A&&B] (intersection) and [A~~B] (symmetric difference) are also available inside
character classes.
The scanner extracts a leading literal prefix from each regex rule and folds it into a
shared Aho-Corasick gate so the regex engine only runs on files that contain the prefix.
A pattern that starts with literal bytes (key_[0-9]{5}&~(...) extracts key_) stays on
the fast path. A pattern that starts with ~(...) or another metacharacter falls into a
smaller residual gate, still correct, just slower per file.
Extracted prefixes preserve the regex source's original UTF-8 bytes verbatim, so a rule
whose leading literal contains non-ASCII characters (em-dash —, smart quotes, ellipsis,
emoji) gates correctly against file content holding the same bytes; a walker that
mojibake'd those bytes during extraction would silently disable the rule by registering
a pattern AC could never match.
Complement-body limitations (resharp 0.5.x)
Resharp 0.5.x cannot reverse a complement whose body contains a lookaround. The parser rewrites several surface atoms to internal lookarounds, so the following shapes fail at compile time:
\binside a~(...)body. Rewritten to negative-lookahead / negative-lookbehind by the parser, then refused. Workaround: replace\bwith\W(consumes a character on each side) or with literal whitespace, or move the boundary check outside the complement.\Binside a~(...)body. Refused at parse time when the neighbours are unclassifiable. No in-place rewrite; restructure the rule.- Unescaped
^or$inside a~(...)body. Rewritten to lookbehind / lookahead in default-multiline mode and then refused. Workaround: use\A/\zfor whole-content anchors, or move the anchor outside the complement. Inline(?-m)and group-scoped(?-m:^foo$)do NOT propagate into the complement body, so neither works as a workaround in resharp 0.5.x. - User-explicit lookarounds (
(?=,(?!,(?<=,(?<!) inside a~(...)body. Refused for the same reason as the rewritten cases. Lift the lookaround outside the complement.
forbidden-strings detects every shape above at rule load time and reports the
specific trigger:
forbidden-strings: rule on line 42 (resharp): complement body contains \b;
resharp 0.5.x rewrites it to an internal lookaround which the reverse pass
refuses. Replace with \W ... See TROUBLESHOOTING.resharp.md for workarounds.
The doc at TROUBLESHOOTING.resharp.md in the repository root has the full
trace, more workarounds, and the upstream-issue draft.
Supported regex flags
The flag string accepts these lowercase letters, applied via resharp's inline-flag group:
i-- case-insensitivem-- multiline (^/$match at line boundaries)s-- dot-matches-newlineu-- toggle Unicode\w/\dsemanticsx-- ignore whitespace and#comments inside the pattern
Resharp's parser also recognises U (swap greed) and R (CRLF line terminators),
but the validator deliberately rejects uppercase flags. Both are useless in this scanner:
U only affects match span length (not whether something matched), and the rare pattern
that needs CRLF-aware anchors can write \r?$ directly. If you ever need them locally
inside one pattern, use the inline form: (?U)foo or (?R)bar.
Output
For each violation:
PATH:LINE:COL_START..COL_END rule=N
Columns are 1-based byte offsets within the matched line. The matched substring is never printed. Only the path, line number, column range, and the opaque rule index appear in failure output; otherwise a failing CI log becomes a leak surface. A contributor wanting to know which rule fired looks up the index against their local rule file.
Exit codes:
0-- no violations1-- one or more violations2-- usage error or rules-file error
Performance
Measured against this repo (2,700 files, 21 MiB) on an AMD Ryzen 7 8700F:
- Realistic ruleset (17 rules), startup-only: 1.2 ms
- Realistic ruleset, full-repo
--all: 15.5 ms (~1.4 GiB/s end-to-end) - Synthetic 1,000-rule (500 literals + 500 prefixed regex), full-repo
--all: 27 ms - Pathological case: 1,000,000 hits in a single 43 MB file: 1.5 s wall
Detailed benchmarks, methodology, and the rationale for rejected optimizations
(mmap, extension pre-filtering, concat-and-scan) live in PERF.md.