docs.rs failed to build forbidden-strings-0.2.0
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

Visit the last successful build: forbidden-strings-0.1.9

forbidden-strings

Linear-time deny-list scanner for Git repos, built on the in-house forbidden-regex engine (package/rust-module/forbidden-regex). A native Rust binary with a sub-commit-budget startup, it scans working-tree files line by line against a deny list of literals and restricted-dialect regexes and reports each match as an opaque, redacted finding.

Rules split into a baseline embedded in the binary (data/builtin-rules.txt, activated by --builtin-rules), a committed shared appendix (forbidden-strings.append.txt), a per-repo gitignored appendix (forbidden-strings.append.local.txt), or a CI-only secret (FORBIDDEN_STRINGS_LIST). The matched substring, the surrounding line, and the rule pattern are never printed in failure output, so a rule body that would itself leak if committed (a customer name, an unreleased project codename, a pre-disclosure partner ID) can live as an appendix or CI secret without exposure on public CI logs.

What's different

**Native binary startup. ** Rust with lto = true, codegen-units = 1, opt-level = 3, panic = "unwind", overflow-checks = true, strip = true. No Node startup, no WASM init, no per-invocation config parse, which is what a sub-100 ms pre-commit budget needs.
**Linear-time matching. ** The engine is derivative and product based with no backtracking, so no rule combination can exhibit catastrophic-backtracking behaviour. A set-level SIMD prefilter lets clean lines skip per-rule work.
**Set-algebra rules. ** Intersection A & B and complement ~(A) are first-class in the dialect, so "match X but not Y" needs no lookaround. PCRE-family engines (gitleaks, trufflehog, secretlint, plain RE2) cannot do this; their workaround is per-rule allowlists, which scale badly.
**Sensitive rules can live out-of-band. ** The committed baseline holds non-sensitive rules; the gitignored appendix and the CI-only FORBIDDEN_STRINGS_LIST secret hold sensitive rules. Failure output never prints the matched substring, the surrounding line, or the rule pattern, so a rule body itself can be a secret.

When to pick something else

forbidden-strings deliberately omits features other scanners ship as core capabilities:

CEL-based post-match filtering (entropy thresholds, BPE token efficiency, git-author predicates, file-path globs, string allowlists). No equivalent here.
Async HTTP validation. No way to call a provider API to confirm a detected secret is live; staleness review is on you.
Git history scanning. The walker enumerates working-tree files only. No equivalent of gitleaks git that scans every diff in every commit.
SARIF / JSON / CSV output. Hits go to stderr as plain text; no machine-readable format for GitHub code-scanning upload or CI dashboards.
Per-rule path scoping. Every rule runs against every non-skipped file; the scanner cannot apply rule X only to YAML files.
Per-rule allowlists. No way to say "rule X but skip when it matches in path Y".
**No streaming or stdin input. ** Files only; the walker enumerates from disk.

If you need any of those, betterleaks or gitleaks is the right tool.

Prerequisites

Rust toolchain. Install via mise: mise install rust.
mise itself, since build commands are mise run tasks.

Build

mise run //package/cli/forbidden-strings:build

The release binary lands at package/cli/forbidden-strings/target/release/forbidden-strings. Root cli-git.config.ts gives that path to the bundled security/forbidden-strings policy; nothing needs to be on $PATH.

Setup

The scanner needs exactly one rules file at scan time. How you produce it is up to you.

Without file-enforcer (most consumers)

Put one rule per line in a file named forbidden-strings.local.txt at the repo root, or pass --rules <PATH> / set FORBIDDEN_STRINGS_RULES=<PATH> to point at any other path. That is the whole setup. For a zero-file start, pass --builtin-rules to scan with the embedded betterleaks-ported baseline (see "Built-in baseline" below). Add the file to .gitignore if the rules themselves are sensitive; otherwise commit it. The "Rule file format" section describes the line syntax. In CI, materialise the file from a secret (see "GitHub Actions" below) so the rule bodies never enter version control.

With file-enforcer (this monorepo's workflow)

Inside the Monochromatic monorepo, no rules file exists at the repository root (see doc/decision/gitignore-negations.md). The pieces:

The betterleaks baseline ships inside the scanner binary (data/builtin-rules.txt, regenerated by package/cli/forbidden-strings/src/mise.port-betterleaks.ts); repo invocations activate it with --builtin-rules (the cli-git policy sets builtinRules: true).
forbidden-strings.append.txt is the committed shared appendix of non-sensitive repo-wide rules.
forbidden-strings.append.local.txt is the per-repo additions. Gitignored, free-form, edited by hand. Place sensitive literals (codenames, customer names, partner IDs) here.
.cache/forbidden-strings.rules.txt is the runtime file consumed by the scanner. Generated by file-enforcer concatenating the two appendixes into the gitignored .cache/ scratch dir. The generated root mise.toml [env] points FORBIDDEN_STRINGS_RULES at it (absolute via {{config_root}}). Do not edit directly.

Run mise run file-enforcer after editing either appendix to regenerate the runtime file. The generator is generateForbiddenStringsRules in file-enforcer.config.ts. If you fork this scanner into a project that does not use file-enforcer, drop the appendix split and follow the single-file workflow above.

Usage

# scan a specific file list (uses ./forbidden-strings.local.txt by default)
forbidden-strings path/to/file other/file

# scan every working-tree file (.gitignore respected; .git/.jj skipped)
forbidden-strings --all

The rules path is resolved in this order: --rules <PATH> flag (highest), then FORBIDDEN_STRINGS_RULES env var, then ./forbidden-strings.local.txt in the current working directory.

# explicit path
forbidden-strings --rules ./other-rules.txt --all

# via env var (CI-friendly: materialize from a secret, then run)
FORBIDDEN_STRINGS_RULES=./materialized.txt forbidden-strings --all

# print version and exit
forbidden-strings --version    # or -V

Built-in baseline (`--builtin-rules`)

The binary embeds the betterleaks-ported baseline ruleset. The text form (data/builtin-rules.txt) is exported as the library constant forbidden_strings::BUILTIN_RULES; the scan path loads the baseline from a serialized RegexSet precompiled at build time (compiling the full baseline at each startup is not viable, so the cost is paid once during the build). It is pure opt-in: without the flag the scanner never reads it, so existing invocations behave exactly as before the flag existed.

With --builtin-rules:

The baseline is appended after the resolved rules file, so rule=N numbers for your own rules do not shift: your rules keep ids 0..user_len and the baseline takes user_len...
When no rules file resolves at all (no --rules, no env var, and no ./forbidden-strings.local.txt in cwd), the baseline alone is the ruleset; passing the flag is itself the configuration.
An explicitly named missing file (--rules <path> or the env var pointing at a path that does not exist) still exits 2: silently scanning without your rules would be a false-clean result.

# zero-file quick start: scan the tree with the embedded baseline only
forbidden-strings --builtin-rules --all

# your rules plus the baseline
forbidden-strings --builtin-rules --rules ./rules.txt --all

--all and positional files are mutually exclusive in practice: if both are passed, the walker output silently overwrites the positional list. Use one or the other.

Rule file format

One rule per line. Two shapes:

A bare line is a case-sensitive literal. It is escaped into the engine's verbose dialect and matched as a plain substring: it fires wherever its exact bytes appear, including glued mid-identifier (ACR matches inside ACRYLIC). There is no length threshold and no word-boundary heuristic; over-matching in this direction is a ratified preference (a false positive inside a longer token is acceptable). If a short literal must match only as a whole word, write it as a regex with explicit boundaries (/\bACR\b/).
A line of the shape /PATTERN/FLAGS is a regex in the forbidden-regex dialect. The first / and the last / delimit the pattern. FLAGS is a trailing run of ASCII-lowercase letters; if the trailing run is not all-lowercase, the whole line is treated as a literal instead (so /foo/I is a literal scan for the seven bytes /foo/I, not a case-insensitive regex).

Empty and whitespace-only lines are ignored. A line whose first non-whitespace byte is # is a comment. One leading UTF-8 BOM is stripped from the source. An empty source (no non-blank, non-comment line) is a rule-file error.

Flags policy

The engine is always in multiline and verbose mode, so the only accepted flags are the ones those two modes already imply:

m (multiline) and x (verbose) are accepted as no-ops and dropped.
**Any other flag letter is a hard, fail-closed load error. ** Silently dropping an i or an s would change match semantics (case folding, dot-matches-newline), so the loader rejects the whole ruleset rather than weaken a rule. Need one of those locally? Restructure the pattern (for case-insensitivity, spell the alternatives: [Aa][Bb][Cc]).

Supported constructs

The dialect is a deliberately restricted subset (see package/rust-module/forbidden-regex/README.md for the full engine spec):

Literals and the escapes \t, \b (word boundary), backslash-escaped metacharacters, and backslash-escaped whitespace.
Character classes: [abc], [a-z], [a-zA-Z], negated [^...], and the shorthands \d \w \s \D \W \S (usable inside classes too).
. matches any byte except a newline.
Grouping and alternation: (?:a|b). Groups are non-capturing only.
Bounded repetition: a?, a{3}, a{3,6}.
Anchors: ^, $, \b. The word set is ASCII [A-Za-z0-9_]; ^ and $ anchor at line boundaries.
Set algebra: intersection &, complement ~(...).

Matching is an unanchored search over a single line's raw bytes: a pattern matches if it matches any substring. Because verbose mode is always on, unescaped whitespace outside character classes is ignored, so a rule may be written across several physical characters for readability; to match a literal space use \ , \t, or [ ].

Set-algebra operators

Two top-level set operators that pure-PCRE engines lack:

A & B (intersection): matches strings matched by both A and B.
~(A) (complement): matches strings that do NOT match A.

Operators & and | take single-atom operands: a literal, a class, ., an anchor, a (?:...) group, or a ~(...). A concatenation or a quantified atom must be wrapped in (?:...) to be an operand, so there is no operator precedence to remember. ~(...) is always parenthesized. A pattern that can match the empty string is rejected (unanchored, it would match everything), so ~(Y) alone is rejected while (?:X) & ~(Y) with a concrete X compiles. Example: ban any five-digit key except the all-zeros placeholder:

/(?:key_[0-9]{5}) & ~(key_00000)/

Rejected at compile time (fail-closed)

Anything outside the supported set is a hard compile error naming the offending rule's opaque index, and the whole load fails closed (a bad ruleset never degrades to a partial scan): *, +, unbounded {n,}, \xNN byte escapes, capturing (, lookaround and inline-flag groups ((? not followed by :), backreferences, unknown escapes, unbalanced brackets, stacked quantifiers, {n,m} with n greater than m, repetition whose expansion exceeds the engine's cap, and any pattern that can match the empty string.

Load errors are redacted: they carry only the opaque rule index (0-based position in the compiled set, never a source line number) and the engine's own static reason, never the rule text. The redacted error type is LoadError in src/rule/frx/error.rs.

Output

For each violation:

PATH:LINE rule=N

LINE is the 1-based line number.
rule=N is the 0-based engine rule id. Your runtime rules take ids 0..user_len; under --builtin-rules the baseline is offset past them. The index is columnless: the engine reports per-line rule indices, not spans, so no COL_START..COL_END segment appears.
One finding is emitted per (line, rule) pair.
**The matched substring, the line content, and the rule pattern are never printed. ** Only the path, line number, and opaque rule index appear, so a failing CI log never becomes a leak surface. A contributor looks the index up against their local rule file.

Two synthetic findings keep the scan fail-closed:

**Read errors. ** A file that cannot be opened (broken symlink, permission denied, deleted during scan) produces PATH: read error: <reason> on stderr and counts toward the exit-1 total. A secret-scanning gate must not pass silently on a file it could not inspect.
**Engine errors. ** If the matcher panics on a file, the catch_unwind boundary in scan_one_set (src/frx_scan.rs) catches it and emits PATH: engine error, again counting toward exit 1 rather than aborting or exiting clean.

Hits go to stderr, not stdout; redirecting 2>/dev/null silently loses the report. Within a file, findings are emitted in set order (runtime rules before the baseline), then by line; across files, ordering is rayon-scheduler-determined, stable on a given input but not alphabetic. Callers that need deterministic cross-file reports should pipe the output into sort.

Exit codes:

0: no violations.
1: one or more violations (real hits, read errors, or engine errors).
2: usage error or rule-file error.

Security model

The redaction guarantee is what lets a rule body itself be a secret. Two boundaries carry it:

**Load path. ** Rule compilation reports only LoadError (src/rule/frx/error.rs), whose every variant is an opaque index plus the engine's static reason; no pattern bytes reach a diagnostic. The compiler builds through RegexSet::new / RegexSet::from_bytes, neither of which logs the pattern.
**Scan path. ** Findings are formatted as PATH:LINE rule=N in src/frx_scan.rs; the matched bytes and the line content are never included. The fail-closed catch_unwind boundary emits only PATH: engine error.

Keep sensitive rule bodies out of tracked files: use the gitignored forbidden-strings.append.local.txt or the CI-only FORBIDDEN_STRINGS_LIST secret, never the committed baseline or appendix. In CI, pipe secrets through printenv rather than interpolating them into a workflow command; shell expansion can leak values even when log masking is enabled.

Integration

Local cli-git policy

Root cli-git.config.ts enables security/forbidden-strings at error severity. The PATH-shadowed cli-git wrapper evaluates selected would-be-committed bytes before commit, landed commit bytes before automatic push, and Git-native outgoing ranges before manual push. Native --no-verify skips Git hooks but does not skip this policy.

Run an explicit read-only check through the built shim with:

git cli-git check --policy security/forbidden-strings --all

The policy invokes the repository-built scanner directly. Scanner infrastructure failures remain distinct exit-2 engine failures; findings exit 1.

GitHub Actions

.github/workflows/forbidden-strings.yml remains independent of cli-git trust and local wrapper state. It downloads the release matching the scanner crate version, verifies the archive's GitHub build-provenance attestation, materializes the committed baseline plus shared appendix and optional repository secret, then invokes the scanner binary directly. Pull-request and merge-queue jobs scan changed files relative to origin/main; pushes to main scan the complete tracked tree. The same precedence applies locally and in CI: --rules > FORBIDDEN_STRINGS_RULES > ./forbidden-strings.local.txt.

Walker behaviour

**--all semantics. ** Walks the working tree via ignore::WalkBuilder in src/walk.rs: .hidden(false) (dotfiles like .github/, .npmrc ARE scanned), .ignore(false) (the .ignore file is NOT consulted; .gitignore stays enabled). Files force-added past .gitignore (git add -f) are recovered via an in-process gix-index read of .git/index; no git subprocess.
**.git/ and .jj/ skipped. ** Internal VCS state is never scanned.
**Symlinks NOT followed. ** WalkBuilder's default follow_links is false; symlinked directories are not descended, symlinked files surface as a read-error synthetic hit on a broken target.
**Non-UTF-8 paths silently dropped. ** Index entries that are not valid UTF-8 are excluded from the walk.
**Binary-file 8 KiB tail cap. ** Files whose first 8 KiB contains a NUL byte are scanned only in the first 8 KiB. The leading window always runs, so secrets there fire; the tail past 8 KiB is skipped. Constant BIN_PROBE_SIZE and read_with_binary_check in src/lib.rs.
**Self-skip set. ** During --all, canonical paths are auto-skipped so rule bodies do not self-match: the materialised rules file (whatever --rules / env var / default resolves to), plus three generated-source paths:
- package/cli/forbidden-strings/data/betterleaks-default-config.toml
- package/cli/forbidden-strings/data/builtin-rules.txt
- package/cli/forbidden-strings/src/port-betterleaks-relaxations.ts
Skip is path-anchored via std::fs::canonicalize, not basename-anchored, so an unrelated file named forbidden-strings.local.txt in a subdirectory is still scanned. Paths that fail to canonicalize from the current cwd are silently dropped from the set. Explicit positional arguments bypass the --all skip; the scanner's own forbidden-strings.*.txt config files at cwd are skipped in both modes (is_config_file_at_cwd in src/lib.rs).

Performance

The scanner is a native Rust binary tuned for a sub-commit-budget startup and linear-time matching; the release profile (Cargo.toml) sets lto, codegen-units = 1, opt-level = 3, panic = "unwind", overflow-checks = true, and strip. panic = "unwind" and overflow-checks = true are load-bearing for the fail-closed catch_unwind boundary, not speed: the forbidden-regex engine documents that it expects the caller's unwind boundary, and a wrapped overflow would otherwise let a corrupt rule fail open.

Full bench methodology and per-version regression history are in PERF.md. The headline figures there were measured against the pre-0.2.0 resharp/regex-crate engine; the 0.2.0 figures for the forbidden-regex engine are re-measured as part of the cutover differential validation.

Fuzzing

Coverage-guided fuzzing lives in its own package, package/fuzz/forbidden-strings, so a scoped nightly toolchain does not force this published crate onto nightly. The scanner exposes a curated internal surface (fuzz_api, behind the fuzzing Cargo feature) for the targets to drive. The teardown that removed the old engine also retired the gate, shard, and routing targets that fuzzed it; the surviving targets are retargeted onto the forbidden-regex load and scan path. See that package's README for prerequisites, mise commands, the bounded-container wrapper, and crash-reproduction guidance.

Architecture

**Two-form loader. ** src/rule/frx owns the rule-file format: bare literals escape into the verbose dialect, /PATTERN/FLAGS lines pass through under the flags policy, and each rule is validated individually so the redacted error can name the first offender's index. src/frx_load.rs resolves the runtime rules file (compiled from text) and, under --builtin-rules, the precompiled baseline, into ordered RegexSets with disjoint rule-id ranges.
**Line-based batch scan. ** src/frx_scan.rs splits each file's bytes into lines once and hands the buffer plus line-start offsets to RegexSet::line_matches, which resolves per-line rule ids in one SIMD prefilter sweep. Each set runs under a catch_unwind boundary so a matcher fault fails closed as a synthetic finding.
**Build-time precompilation. ** build.rs compiles data/builtin-rules.ported.txt through the engine once at build time and serializes it (to_bytes); lib.rs embeds the blob with include_bytes! and the loader rebuilds it via the validating from_bytes, never recompiling. Only the small runtime rules file compiles from text at startup.
**Concurrent load and walk. ** Rule loading and --all file walking run concurrently via rayon::join (they share no state); files then fan out across the rayon thread pool for the parallel scan.
**ignore crate walker + in-process gix-index union. ** --all uses ignore::WalkBuilder (honouring .gitignore, .git/info/exclude, and global excludes) and unions the result with an in-process gix_index::File read of .git/index so git add -f files are still discovered. See src/walk.rs.
**Bundled data/betterleaks-default-config.toml. ** Upstream-vendored provenance for the betterleaks port; the embedded baseline is derived from it, and port-betterleaks-relaxations.ts records the lossy translations applied during the port.

forbidden-strings 0.2.0