# scrump
> Fast, format-aware secret scrubber for binary capture artifacts.
Born from a real incident: a `perf` profile tarball uploaded to a public
GitHub issue leaked a `GH_TOKEN`, because the environment block of the
captured process — including every API key in scope at runtime — sat
inside the binary blob, and GitHub's native secret scanning couldn't see
through the format.
`scrump` opens captures using the same on-disk specs the originating tools
use, finds dangerous content with a 1,100+ rule pattern engine, zero-fills
it in place, and returns the file in its original shape. The redacted
`perf.data` still loads in `perf report`; the redacted `nsys-rep` still
opens in Nsight Systems; the redacted SQLite still passes `sqlite3
".schema"`; the redacted pcap still opens in Wireshark.
[](https://github.com/avifenesh/scrump/actions/workflows/ci.yml)
[](LICENSE)
[](https://www.rust-lang.org)
[](https://github.com/avifenesh/scrump/releases)
## Formats
| `passthrough` | `scrump-format-passthrough` | Any file — single raw chunk fallback |
| `perf` | `scrump-format-perf` | `PERFILE2`. Header + feature sections (`HEADER_CMDLINE`, etc.) + data section. |
| `tar` | `scrump-format-tar` | `tar` / `tar.gz` / `tar.zst` / `zip`. Each member is recursively format-dispatched. |
| `sqlite` | `scrump-format-sqlite` | `SQLite format 3`. Walks every user table, redacts TEXT/BLOB cells via `UPDATE` + `VACUUM`. |
| `nsys` | `scrump-format-nsys` | NVIDIA `.nsys-rep` / `.ncu-rep`. tar envelope + format-aware SQLite handling for the inner DB. |
| `elf-core` | `scrump-format-core` | 64-bit LE ELF `ET_CORE`. Walks `PT_NOTE` for `NT_PRPSINFO` cmdline + `PT_LOAD` env pages. |
| `hprof` | `scrump-format-hprof` | Java HPROF. `JAVA PROFILE` header + record stream; tight UTF8 STRING chunks. |
| `jfr` | `scrump-format-jfr` | Java Flight Recorder. Walks chunks via `FLR\0` magic + `chunk_size`; refuses to touch chunk headers. |
| `pcap` | `scrump-format-pcap` | tcpdump pcap + pcapng. Per-packet payload chunks (Authorization headers, query strings); framing untouched. |
## Install
```sh
# From source (Rust 1.75+)
cargo install --path crates/scrump-cli
# Or grab a pre-built binary from the latest release (see the Releases tab
# for the current version; tarballs are signed and shasum'd):
gh release download --repo avifenesh/scrump --pattern '*-x86_64-unknown-linux-gnu.tar.gz'
```
Supported targets out of the box:
| `x86_64-unknown-linux-gnu` | tier-1 (cross-compiled in CI release) |
| `aarch64-unknown-linux-gnu` | tier-1 (cross-compiled in CI release) |
| `aarch64-apple-darwin` | tier-1 (cross-compiled in CI release) |
## Use
```sh
scrump scan some-file # dry-run: report findings, never mutate
scrump scrub some-file # redact in place (atomic tmp+rename)
scrump scrub some-file -o clean # write clean copy elsewhere
scrump scrub some-file --backup # also keep the original at *.orig
scrump scrub some-file --format perf # force a specific format handler
scrump scrub some-file --rules-path my.yaml # add custom rules
```
## How detection works
Two-layer ruleset:
1. **Curated default rules** (`crates/scrump-rules/rules/default.yaml`) —
tightly-scoped patterns for the ML/inference ecosystem: GitHub PATs,
HuggingFace, OpenAI, Anthropic, AWS, Slack, NVIDIA NGC, W&B, Stripe.
2. **Auto-extracted TruffleHog mirror** (`rules/trufflehog.yaml`,
regenerated by `cargo run -p scrump-trufflehog-compat --bin th-extract`) —
1,100+ rules covering every detector under `pkg/detectors/`.
3. **Hand-coded** detectors for things regex can't express alone —
currently `JwtHsAware` which base64-decodes the JWT header and
rejects HMAC-signed tokens, mirroring TruffleHog's filtering.
The engine supports `capture_index` for keyword-proximity patterns (e.g.
W&B's bare 40-hex token near a `wandb` keyword) and `post_filter` for
semantic constraints beyond regex.
## Format support — how it stays structure-preserving
Each format crate implements:
```rust
pub trait Format: Send {
fn name(&self) -> &'static str;
fn chunks<'a>(&'a self) -> Box<dyn Iterator<Item = Chunk<'a>> + 'a>;
fn apply(&mut self, hits: &[Hit]) -> Result<()>;
fn to_bytes(&self) -> Result<Vec<u8>>;
}
```
The format decides which byte ranges are *scannable* (cmdline strings,
TEXT cells, packet payloads, chunk bodies) and which are *structural*
(magic words, length prefixes, varints, checksums). `apply` refuses to
redact structural bytes. The result: scrump can never produce a file
its own format parser couldn't parse.
## Compatibility evidence
scrump is validated against two third-party test corpora. The harnesses
live under `crates/scrump-{trufflehog,presidio}-compat/`.
### TruffleHog detectors
`cargo run -p scrump-trufflehog-compat --bin trufflehog-compat` walks
every `*_test.go` under TruffleHog's `pkg/detectors/`, parses each
parametrized test, and runs scrump against the test input.
**Last full run: 2,309 of 2,493 cases pass across 854 providers
(92.6%).** The remaining 184 are negative-case false-positives where
provider A's no-hit-expected input still trips provider B's
auto-extracted `PrefixRegex` (e.g. a `sugester` test input fires the
`tableau` rule). They are over-detection in a scrubbing context —
nothing TruffleHog catches is missed by scrump. CI gates on
`SCRUMP_TH_MAX_FAILURES=184`; lowering this number must accompany rule
fixes, and any increase fails the build. After the #9 rule curation,
82 structurally-broken patterns are dropped at load time (see
`TH_QUARANTINE` in `scrump-rules`) — this is what drove the harness
from 201 failures down to 184 and the per-MB hit rate on real SQLite
log artifacts from ~85,000 to under 0.5.
### Microsoft Presidio (PII) cross-format
`cargo run -p scrump-presidio-compat --bin presidio-compat` takes every
Presidio recognizer test (52 recognizers, 671 cases), then for every
case embeds the test text into every binary format scrump supports
(7 + passthrough = 8) and runs the detector against the embedded blob.
**Last full run: 617 of 671 (92.0%) pass — identical pass count across
all 8 formats**, proving the format wrapper is transparent to detection.
The remaining 54 failures are entirely Presidio patterns that use
lookbehind / backreferences that Rust's `regex` crate doesn't support
(IP recognizer, MAC, Canadian SIN with backref-bound separator).
```
FORMAT PASS FAIL SKIP PASS%
--------------------------------------------------------
passthrough 617 54 0 92.0%
tar 617 54 0 92.0%
perf 617 54 0 92.0%
sqlite 617 54 0 92.0%
elf-core 617 54 0 92.0%
hprof 617 54 0 92.0%
jfr 617 54 0 92.0%
pcap 617 54 0 92.0%
```
## Local dev loop
If you have [`just`](https://github.com/casey/just) installed:
```sh
just check # fmt + clippy + tests + docs
just e2e # all 8 phase-gate scripts
just compat-trufflehog # 864-provider parity (clones vendor/trufflehog on first run)
just compat-presidio # 52 recognizers × 8 formats
just deny # cargo-deny supply-chain audit
just ci # everything CI runs, in order
```
Without `just`, see the recipes in the [`Justfile`](Justfile) for the
underlying `cargo` invocations.
## Phase gates
End-to-end gates live under `tests/`:
- `tests/e2e.sh` — phase 0 (passthrough on a planted-token text file)
- `tests/e2e_phase1.sh` — perf.data
- `tests/e2e_phase2.sh` — tar / tar.gz / tar.zst / zip
- `tests/e2e_phase3.sh` — sqlite + nsys-rep
- `tests/e2e_phase4.sh` — ELF core
- `tests/e2e_phase5.sh` — Java HPROF
- `tests/e2e_phase6.sh` — JFR
- `tests/e2e_phase7.sh` — pcap
- `tests/e2e_all.sh` — master gate; runs all 8
Each gate plants known token shapes, runs `scrump scrub`, then asserts
the file size is preserved, the format's magic / structural fields are
untouched, the format's native tooling still parses it, and no token
prefix remains in the raw bytes.
## Project layout
```
scrump/
├── crates/
│ ├── scrump-core/ # Format trait, Hit, Dispatcher
│ ├── scrump-detect/ # regex + entropy engine
│ ├── scrump-rules/ # curated + auto-extracted rule sets
│ ├── scrump-cli/ # the `scrump` binary
│ ├── scrump-format-passthrough/ # text-and-anything fallback
│ ├── scrump-format-perf/ # PERFILE2
│ ├── scrump-format-tar/ # tar / zip / gz / zst (recursive)
│ ├── scrump-format-sqlite/ # SQLite3
│ ├── scrump-format-nsys/ # NVIDIA nsys-rep / ncu-rep
│ ├── scrump-format-core/ # ELF core dumps
│ ├── scrump-format-hprof/ # Java HPROF
│ ├── scrump-format-jfr/ # Java Flight Recorder
│ ├── scrump-format-pcap/ # pcap / pcapng
│ ├── scrump-test-fixtures/ # spec-compliant generators
│ ├── scrump-trufflehog-compat/ # 864-provider parity harness
│ └── scrump-presidio-compat/ # 8-format × 52-recognizer harness
├── tests/ # phase 0..7 e2e gates
└── docs/ # architecture, threat model
```
See [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) for the internal design,
and [`CONTRIBUTING.md`](CONTRIBUTING.md) for the format/detector
add-a-new-X checklists.
## Security
scrump is a security tool — please report vulnerabilities privately via
the process in [`SECURITY.md`](SECURITY.md).
## License
[Apache-2.0](LICENSE). Inspired by — but does not wrap — TruffleHog and
noseyparker (both Apache-2.0).