scrump-core 0.1.8

Core traits and types for the scrump secret scrubber.
Documentation
# scrump

> Fast, format-aware secret scrubber for binary capture artifacts.

Born from a real incident: a `perf` profile tarball uploaded to a public
GitHub issue leaked a `GH_TOKEN`, because the environment block of the
captured process — including every API key in scope at runtime — sat
inside the binary blob, and GitHub's native secret scanning couldn't see
through the format.

`scrump` opens captures using the same on-disk specs the originating tools
use, finds dangerous content with a 1,100+ rule pattern engine, zero-fills
it in place, and returns the file in its original shape. The redacted
`perf.data` still loads in `perf report`; the redacted `nsys-rep` still
opens in Nsight Systems; the redacted SQLite still passes `sqlite3
".schema"`; the redacted pcap still opens in Wireshark.

[![CI](https://github.com/avifenesh/scrump/actions/workflows/ci.yml/badge.svg)](https://github.com/avifenesh/scrump/actions/workflows/ci.yml)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)
[![Rust: 1.75+](https://img.shields.io/badge/rust-1.75%2B-orange.svg)](https://www.rust-lang.org)
[![GitHub release](https://img.shields.io/github/v/release/avifenesh/scrump?include_prereleases&sort=semver)](https://github.com/avifenesh/scrump/releases)

## Formats

| Format | Crate | What it understands |
|---|---|---|
| `passthrough` | `scrump-format-passthrough` | Any file — single raw chunk fallback |
| `perf` | `scrump-format-perf` | `PERFILE2`. Header + feature sections (`HEADER_CMDLINE`, etc.) + data section. |
| `tar` | `scrump-format-tar` | `tar` / `tar.gz` / `tar.zst` / `zip`. Each member is recursively format-dispatched. |
| `sqlite` | `scrump-format-sqlite` | `SQLite format 3`. Walks every user table, redacts TEXT/BLOB cells via `UPDATE` + `VACUUM`. |
| `nsys` | `scrump-format-nsys` | NVIDIA `.nsys-rep` / `.ncu-rep`. tar envelope + format-aware SQLite handling for the inner DB. |
| `elf-core` | `scrump-format-core` | 64-bit LE ELF `ET_CORE`. Walks `PT_NOTE` for `NT_PRPSINFO` cmdline + `PT_LOAD` env pages. |
| `hprof` | `scrump-format-hprof` | Java HPROF. `JAVA PROFILE` header + record stream; tight UTF8 STRING chunks. |
| `jfr` | `scrump-format-jfr` | Java Flight Recorder. Walks chunks via `FLR\0` magic + `chunk_size`; refuses to touch chunk headers. |
| `pcap` | `scrump-format-pcap` | tcpdump pcap + pcapng. Per-packet payload chunks (Authorization headers, query strings); framing untouched. |

## Install

```sh
# From source (Rust 1.75+)
cargo install --path crates/scrump-cli

# Or grab a pre-built binary from the latest release (see the Releases tab
# for the current version; tarballs are signed and shasum'd):
gh release download --repo avifenesh/scrump --pattern '*-x86_64-unknown-linux-gnu.tar.gz'
```

Supported targets out of the box:

| Target | Tier |
|---|---|
| `x86_64-unknown-linux-gnu` | tier-1 (cross-compiled in CI release) |
| `aarch64-unknown-linux-gnu` | tier-1 (cross-compiled in CI release) |
| `aarch64-apple-darwin`     | tier-1 (cross-compiled in CI release) |

## Use

```sh
scrump scan some-file              # dry-run: report findings, never mutate
scrump scrub some-file             # redact in place (atomic tmp+rename)
scrump scrub some-file -o clean    # write clean copy elsewhere
scrump scrub some-file --backup    # also keep the original at *.orig
scrump scrub some-file --format perf    # force a specific format handler
scrump scrub some-file --rules-path my.yaml    # add custom rules
```

## How detection works

Two-layer ruleset:

1. **Curated default rules** (`crates/scrump-rules/rules/default.yaml`) —
   tightly-scoped patterns for the ML/inference ecosystem: GitHub PATs,
   HuggingFace, OpenAI, Anthropic, AWS, Slack, NVIDIA NGC, W&B, Stripe.
2. **Auto-extracted TruffleHog mirror** (`rules/trufflehog.yaml`,
   regenerated by `cargo run -p scrump-trufflehog-compat --bin th-extract`) —
   1,100+ rules covering every detector under `pkg/detectors/`.
3. **Hand-coded** detectors for things regex can't express alone —
   currently `JwtHsAware` which base64-decodes the JWT header and
   rejects HMAC-signed tokens, mirroring TruffleHog's filtering.

The engine supports `capture_index` for keyword-proximity patterns (e.g.
W&B's bare 40-hex token near a `wandb` keyword) and `post_filter` for
semantic constraints beyond regex.

## Format support — how it stays structure-preserving

Each format crate implements:

```rust
pub trait Format: Send {
    fn name(&self) -> &'static str;
    fn chunks<'a>(&'a self) -> Box<dyn Iterator<Item = Chunk<'a>> + 'a>;
    fn apply(&mut self, hits: &[Hit]) -> Result<()>;
    fn to_bytes(&self) -> Result<Vec<u8>>;
}
```

The format decides which byte ranges are *scannable* (cmdline strings,
TEXT cells, packet payloads, chunk bodies) and which are *structural*
(magic words, length prefixes, varints, checksums). `apply` refuses to
redact structural bytes. The result: scrump can never produce a file
its own format parser couldn't parse.

## Compatibility evidence

scrump is validated against two third-party test corpora. The harnesses
live under `crates/scrump-{trufflehog,presidio}-compat/`.

### TruffleHog detectors

`cargo run -p scrump-trufflehog-compat --bin trufflehog-compat` walks
every `*_test.go` under TruffleHog's `pkg/detectors/`, parses each
parametrized test, and runs scrump against the test input.

**Last full run: 2,309 of 2,493 cases pass across 854 providers
(92.6%).** The remaining 184 are negative-case false-positives where
provider A's no-hit-expected input still trips provider B's
auto-extracted `PrefixRegex` (e.g. a `sugester` test input fires the
`tableau` rule). They are over-detection in a scrubbing context —
nothing TruffleHog catches is missed by scrump. CI gates on
`SCRUMP_TH_MAX_FAILURES=184`; lowering this number must accompany rule
fixes, and any increase fails the build. After the #9 rule curation,
82 structurally-broken patterns are dropped at load time (see
`TH_QUARANTINE` in `scrump-rules`) — this is what drove the harness
from 201 failures down to 184 and the per-MB hit rate on real SQLite
log artifacts from ~85,000 to under 0.5.

### Microsoft Presidio (PII) cross-format

`cargo run -p scrump-presidio-compat --bin presidio-compat` takes every
Presidio recognizer test (52 recognizers, 671 cases), then for every
case embeds the test text into every binary format scrump supports
(7 + passthrough = 8) and runs the detector against the embedded blob.

**Last full run: 617 of 671 (92.0%) pass — identical pass count across
all 8 formats**, proving the format wrapper is transparent to detection.
The remaining 54 failures are entirely Presidio patterns that use
lookbehind / backreferences that Rust's `regex` crate doesn't support
(IP recognizer, MAC, Canadian SIN with backref-bound separator).

```
FORMAT             PASS     FAIL     SKIP    PASS%
--------------------------------------------------------
passthrough         617       54        0    92.0%
tar                 617       54        0    92.0%
perf                617       54        0    92.0%
sqlite              617       54        0    92.0%
elf-core            617       54        0    92.0%
hprof               617       54        0    92.0%
jfr                 617       54        0    92.0%
pcap                617       54        0    92.0%
```

## Local dev loop

If you have [`just`](https://github.com/casey/just) installed:

```sh
just check                  # fmt + clippy + tests + docs
just e2e                    # all 8 phase-gate scripts
just compat-trufflehog      # 864-provider parity (clones vendor/trufflehog on first run)
just compat-presidio        # 52 recognizers × 8 formats
just deny                   # cargo-deny supply-chain audit
just ci                     # everything CI runs, in order
```

Without `just`, see the recipes in the [`Justfile`](Justfile) for the
underlying `cargo` invocations.

## Phase gates

End-to-end gates live under `tests/`:

- `tests/e2e.sh` — phase 0 (passthrough on a planted-token text file)
- `tests/e2e_phase1.sh` — perf.data
- `tests/e2e_phase2.sh` — tar / tar.gz / tar.zst / zip
- `tests/e2e_phase3.sh` — sqlite + nsys-rep
- `tests/e2e_phase4.sh` — ELF core
- `tests/e2e_phase5.sh` — Java HPROF
- `tests/e2e_phase6.sh` — JFR
- `tests/e2e_phase7.sh` — pcap
- `tests/e2e_all.sh` — master gate; runs all 8

Each gate plants known token shapes, runs `scrump scrub`, then asserts
the file size is preserved, the format's magic / structural fields are
untouched, the format's native tooling still parses it, and no token
prefix remains in the raw bytes.

## Project layout

```
scrump/
├── crates/
│   ├── scrump-core/                # Format trait, Hit, Dispatcher
│   ├── scrump-detect/              # regex + entropy engine
│   ├── scrump-rules/               # curated + auto-extracted rule sets
│   ├── scrump-cli/                 # the `scrump` binary
│   ├── scrump-format-passthrough/  # text-and-anything fallback
│   ├── scrump-format-perf/         # PERFILE2
│   ├── scrump-format-tar/          # tar / zip / gz / zst (recursive)
│   ├── scrump-format-sqlite/       # SQLite3
│   ├── scrump-format-nsys/         # NVIDIA nsys-rep / ncu-rep
│   ├── scrump-format-core/         # ELF core dumps
│   ├── scrump-format-hprof/        # Java HPROF
│   ├── scrump-format-jfr/          # Java Flight Recorder
│   ├── scrump-format-pcap/         # pcap / pcapng
│   ├── scrump-test-fixtures/       # spec-compliant generators
│   ├── scrump-trufflehog-compat/   # 864-provider parity harness
│   └── scrump-presidio-compat/     # 8-format × 52-recognizer harness
├── tests/                          # phase 0..7 e2e gates
└── docs/                           # architecture, threat model
```

See [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) for the internal design,
and [`CONTRIBUTING.md`](CONTRIBUTING.md) for the format/detector
add-a-new-X checklists.

## Security

scrump is a security tool — please report vulnerabilities privately via
the process in [`SECURITY.md`](SECURITY.md).

## License

[Apache-2.0](LICENSE). Inspired by — but does not wrap — TruffleHog and
noseyparker (both Apache-2.0).