forbidden-strings 0.1.0

Out-of-band scanner for forbidden literal strings and regex patterns. Gitignore-aware, fast, dependency-light: built for CI deny-listing of leaked credentials and banned tokens.
forbidden-strings-0.1.0 is not a library.
Visit the last successful build: forbidden-strings-0.1.9

forbidden-strings

Ready to publish.

A linear-time deny-list scanner for Git repos, designed for the case where the rules themselves are sensitive.

Most secret-scanning tools (gitleaks, trufflehog, secretlint) ship rules in committed config files and rely on keywords prefilters. That breaks down when the forbidden literals would themselves leak if committed: a customer name, a partner identifier, an internal codename, a pre-disclosure project name. forbidden-strings keeps the rule file out of the repo entirely, accepts it via env var or --rules, and never prints the matched substring in failure output.

What's different

  • Resharp set-algebra rules. A&B (intersection) and ~(A) (complement) are first-class. Express "match X but not Y" without lookaround. PCRE-family engines (gitleaks, trufflehog, secretlint, plain RE2) cannot do this; the workaround in those tools is per-rule allowlists, which scale badly.
  • Out-of-band rules. --rules <path> or FORBIDDEN_STRINGS_RULES. CI materializes the rule file from a repository secret; contributors keep a gitignored copy locally. No rule ever lives in a committed file.
  • Redacted output. Only path:line:cols rule=N is printed. The matched substring, surrounding line, and rule pattern never appear in the failure log, so a public CI log is not itself a leak surface.
  • Linear-time matching. Resharp is derivative-based with no backtracking; Aho-Corasick gates the regex engine via extracted literal prefixes. A pathological rule combination cannot exhibit catastrophic-backtracking behavior.
  • Native Rust binary, no startup tax. Full-repo scan over 2,700 files in ~16 ms; 1,000-rule synthetic in ~27 ms; pre-commit hot path well under 5 ms. No Node startup, no WASM init, no per-invocation TOML parse.

When to pick something else

If the rule set can ship in the repo and set-algebra is unnecessary, gitleaks or betterleaks is probably a better fit: larger ecosystem, SARIF output, GitHub-native code-scanning upload, allowlist files keyed on path globs. The niche forbidden-strings fills is "rules that cannot be committed AND the rule grammar needs operators PCRE doesn't have"; pick it when one of those two is the binding constraint.

Build

mise run //packages/cli/forbidden-strings:build

The release binary lands at packages/cli/forbidden-strings/target/release/forbidden-strings. hk.pkl invokes that path directly; nothing needs to be on $PATH.

Usage

# scan a specific file list (uses ./forbidden-strings.local.txt by default)
forbidden-strings path/to/file other/file

# scan every git-tracked file (respects .gitignore via git ls-files)
forbidden-strings --all

The rules path is resolved in this order: --rules <PATH> flag, then FORBIDDEN_STRINGS_RULES env var, then ./forbidden-strings.local.txt in the current working directory.

# explicit path
forbidden-strings --rules ./other-rules.txt --all

# via env var (CI-friendly: materialize from a secret, then run)
FORBIDDEN_STRINGS_RULES=./materialized.txt forbidden-strings --all

Local hook setup

hk replaces husky for this repo. Wire git hooks once per machine:

hk install --global   # recommended; needs Git 2.54+
# or, per-repo:
hk install

Then git commit runs the pre-commit hook defined in hk.pkl, which in turn runs this scanner against the staged files.

CI integration (GitHub Actions)

Materialize the rule file from a repository secret at job-start; the workflow YAML never echoes the secret value:

- name: Materialize rule file
  env:
    RULES: ${{ secrets.FORBIDDEN_STRINGS_LIST }}
  run: printenv RULES > forbidden-strings.local.txt

- name: Build scanner
  run: mise run //packages/cli/forbidden-strings:build

- name: Scan
  run: hk check --from-ref origin/main

Pipe via printenv > file rather than interpolating the secret into a run: block; shell expansion in the latter can leak the value to the log even with GitHub's masking.

Rule file format

One rule per line. Two shapes:

  • A bare line is a case-sensitive literal. Match semantics depend on length:
    • Length below 7 bytes: conditional word-boundary check (grep -w semantics). A boundary is required at any end whose edge byte is a word character ([A-Za-z0-9_]); the file context on that side must be either start/end of file or a non-word byte. So a short alpha-only acronym matches a standalone occurrence in normal prose but does not match coincidentally as a substring of a longer identifier or inside random base64 noise. Path-shaped literals like /etc/passwd still match inside cat /etc/passwd because the leading / is non-word so no left-side boundary is enforced.
    • Length 7 bytes or more: pure case-sensitive substring match, no boundary check. A long literal matches anywhere it appears, including glued mid-identifier. Distinctiveness from sheer length makes coincidental substring match negligible. If a phrase exists in two written forms (with and without internal whitespace), add both as separate rules so each matches its respective form.
  • A line of the shape /PATTERN/FLAGS is a regex. The first / and last / delimit the pattern; FLAGS is zero or more lowercase letters and is rewritten to a resharp inline-flag prefix (e.g. /foo/i becomes (?i)foo). Use this form to opt into substring-anywhere semantics for short literals (write the literal between the slashes), or to ban literals matching ^/.+/[a-z]*$ (escape the slashes, e.g. ban the literal /etc/passwd as /\/etc\/passwd/).

Empty lines are ignored. Lines starting with # are comments.

The 7-byte threshold for word-bounded vs substring-anywhere comes from a coincidence-rate calculation: a length-L literal in a case-sensitive alphabet of size A scanned over N random bytes has expected coincidence count ~= N * A^(-L). At L = 7, in 1 GB of dense base64 (A = 64) or random alphanumeric (A = 62) noise, the expected coincidence per rule is ~2.3e-4 and ~3.0e-4 respectively, comfortably under 1 across realistic repo sizes and noise types. At L = 6 the same calculation gives ~0.015 / ~0.019, which becomes borderline once a repo has multiple GB of dense content or 100+ deny-list rules. See SUBSTRING_THRESHOLD in src/rules.rs for the threshold constant and derivation.

One known regression under the new semantics: a short literal rule will not match a plural or suffixed form (e.g. a 3-letter acronym does not match the same acronym followed immediately by s, because the trailing s is a word char and the boundary fails). If plural matching is needed, express the rule as a regex with an optional trailing class: /ACRONYMs?/.

Set-algebra operators

Resharp extends standard regex with two top-level set operators that pure-PCRE engines lack:

  • A&B -- intersection: matches strings matched by both A and B
  • ~(A) -- complement: matches strings that do NOT match A

Combined, these express "match X but not Y" without lookaround. Example: ban any five-digit key except the all-zeros placeholder:

/key_[0-9]{5}&~(key_0{5})/

This flags key_12345 and key_99999 but lets key_00000 through. Class-level forms [A&&B] (intersection) and [A~~B] (symmetric difference) are also available inside character classes.

The scanner extracts a leading literal prefix from each regex rule and folds it into a shared Aho-Corasick gate so the regex engine only runs on files that contain the prefix. A pattern that starts with literal bytes (key_[0-9]{5}&~(...) extracts key_) stays on the fast path. A pattern that starts with ~(...) or another metacharacter falls into a smaller residual gate, still correct, just slower per file. Extracted prefixes preserve the regex source's original UTF-8 bytes verbatim, so a rule whose leading literal contains non-ASCII characters (em-dash , smart quotes, ellipsis, emoji) gates correctly against file content holding the same bytes; a walker that mojibake'd those bytes during extraction would silently disable the rule by registering a pattern AC could never match.

Complement-body limitations (resharp 0.5.x)

Resharp 0.5.x cannot reverse a complement whose body contains a lookaround. The parser rewrites several surface atoms to internal lookarounds, so the following shapes fail at compile time:

  • \b inside a ~(...) body. Rewritten to negative-lookahead / negative-lookbehind by the parser, then refused. Workaround: replace \b with \W (consumes a character on each side) or with literal whitespace, or move the boundary check outside the complement.
  • \B inside a ~(...) body. Refused at parse time when the neighbours are unclassifiable. No in-place rewrite; restructure the rule.
  • Unescaped ^ or $ inside a ~(...) body. Rewritten to lookbehind / lookahead in default-multiline mode and then refused. Workaround: use \A / \z for whole-content anchors, or move the anchor outside the complement. Inline (?-m) and group-scoped (?-m:^foo$) do NOT propagate into the complement body, so neither works as a workaround in resharp 0.5.x.
  • User-explicit lookarounds ((?=, (?!, (?<=, (?<!) inside a ~(...) body. Refused for the same reason as the rewritten cases. Lift the lookaround outside the complement.

forbidden-strings detects every shape above at rule load time and reports the specific trigger:

forbidden-strings: rule on line 42 (resharp): complement body contains \b;
resharp 0.5.x rewrites it to an internal lookaround which the reverse pass
refuses. Replace with \W ... See TROUBLESHOOTING.resharp.md for workarounds.

The doc at TROUBLESHOOTING.resharp.md in the repository root has the full trace, more workarounds, and the upstream-issue draft.

Supported regex flags

The flag string accepts these lowercase letters, applied via resharp's inline-flag group:

  • i -- case-insensitive
  • m -- multiline (^/$ match at line boundaries)
  • s -- dot-matches-newline
  • u -- toggle Unicode \w/\d semantics
  • x -- ignore whitespace and # comments inside the pattern

Resharp's parser also recognises U (swap greed) and R (CRLF line terminators), but the validator deliberately rejects uppercase flags. Both are useless in this scanner: U only affects match span length (not whether something matched), and the rare pattern that needs CRLF-aware anchors can write \r?$ directly. If you ever need them locally inside one pattern, use the inline form: (?U)foo or (?R)bar.

Output

For each violation:

PATH:LINE:COL_START..COL_END rule=N

Columns are 1-based byte offsets within the matched line. The matched substring is never printed. Only the path, line number, column range, and the opaque rule index appear in failure output; otherwise a failing CI log becomes a leak surface. A contributor wanting to know which rule fired looks up the index against their local rule file.

Exit codes:

  • 0 -- no violations
  • 1 -- one or more violations
  • 2 -- usage error or rules-file error

Performance

Measured against this repo (2,700 files, 21 MiB) on an AMD Ryzen 7 8700F:

  • Realistic ruleset (17 rules), startup-only: 1.2 ms
  • Realistic ruleset, full-repo --all: 15.5 ms (~1.4 GiB/s end-to-end)
  • Synthetic 1,000-rule (500 literals + 500 prefixed regex), full-repo --all: 27 ms
  • Pathological case: 1,000,000 hits in a single 43 MB file: 1.5 s wall

Detailed benchmarks, methodology, and the rationale for rejected optimizations (mmap, extension pre-filtering, concat-and-scan) live in PERF.md.