scrubbers 0.1.1

High-throughput redaction engine + CLI
Documentation

ScrubbeRS

ScrubbeRS is a Rust-first, zero-copy (in-place) redaction engine with:

  • A stdin → stdout CLI for shell pipelines.
  • A Rust library API.
  • Optional Python and Node.js bindings.
  • Built-in high-confidence detector signatures for direct redaction.
  • Optional .scrub signature files for custom org-specific patterns.

Why this is fast

  • Redaction happens in place (&mut [u8]) with byte-mask filling.
  • Literal signatures are matched with Aho-Corasick (single-pass multi-pattern automaton).
  • Regex signatures use compiled regex::bytes::Regex and run against raw bytes.
  • Release profile is tuned (lto=fat, codegen-units=1, panic=abort).

CLI usage

# Build release binary
cargo build --release

# Pipe mode (stdin -> stdout)
cat app.log | ./target/release/scrubbers > redacted.log

# With custom signatures
cat app.log | ./target/release/scrubbers --scrub-file .scrub > redacted.log

# Custom mask byte
cat app.log | ./target/release/scrubbers --mask "#" > redacted.log

# Line-oriented streaming mode for log pipelines
tail -F app.log | ./target/release/scrubbers --stream-lines

.scrub format

Each non-empty, non-comment line is either:

  1. name=regex_or_literal
  2. regex_or_literal (auto-named)

Example:

# redact internal session tokens
session_token=sess_[A-Za-z0-9]{32}

# redact literal phrase
MY_INTERNAL_SECRET_PREFIX

Rust API

use std::io::Cursor;
use scrubbers::Scrubber;

let scrubber = Scrubber::new()?;
let mut bytes = b"ghp_123456789012345678901234567890123456".to_vec();
scrubber.scrub_in_place(&mut bytes);

let mut output = Vec::new();
scrubber.scrub_lines(
    Cursor::new(b"safe\nprefix ghp_123456789012345678901234567890123456 suffix\n"),
    &mut output,
)?;

Python bindings

Install from PyPI once published:

uv add scrubbers

Build Python distributions locally with uv:

uv build

On Linux, a local uv build wheel is useful for smoke testing but is not guaranteed to be PyPI-uploadable. The release workflow builds Linux wheels in a manylinux container.

Exposed functions:

  • scrubbers.scrub_bytes(data: bytes) -> bytes
  • scrubbers.scrub_text(data: str) -> str
  • scrubbers.scrub_lines_bytes(data: bytes) -> bytes
  • scrubbers.scrub_lines_text(data: str) -> str

The scrub_lines_* helpers apply the library's newline-delimited streaming path over the provided input.

Example:

import scrubbers

scrubbers.scrub_text("prefix ghp_123456789012345678901234567890123456 suffix")
# "prefix **************************************** suffix"

scrubbers.scrub_lines_text("safe\nprefix ghp_123456789012345678901234567890123456 suffix\n")
# "safe\nprefix **************************************** suffix\n"

Smoke test the built wheel and sdist locally:

python3 scripts/test_python_package.py --artifact all

You can still exercise the raw extension crate directly with:

cargo build --release --manifest-path bindings/python/Cargo.toml

Node.js bindings

Build the Node extension crate:

cargo build --release --manifest-path bindings/node/Cargo.toml

Exposed functions:

  • scrubBuffer(buf: Buffer) -> Buffer
  • scrubLinesBuffer(buf: Buffer) -> Buffer

scrubLinesBuffer(...) applies the library's newline-delimited streaming path over the provided buffer.

Example:

const { scrubBuffer, scrubLinesBuffer } = require("./scrubbers.node");

scrubBuffer(Buffer.from("prefix ghp_123456789012345678901234567890123456 suffix"))
// <Buffer 70 72 65 66 69 78 20 2a ...>

scrubLinesBuffer(
  Buffer.from("safe\nprefix ghp_123456789012345678901234567890123456 suffix\n", "utf8"),
).toString("utf8");
// "safe\nprefix **************************************** suffix\n"

Run binding smoke tests locally:

python3 scripts/test_bindings.py --binding all

Verify the publishable Python package locally:

python3 scripts/test_python_package.py --artifact all

Benchmark the Python binding in a logging-style path:

python3 scripts/bench_python_bindings.py

Publishing

Release publishing is tag-driven through publish.yml:

python3 scripts/release.py

Dry run the release flow first:

python3 scripts/release.py --dry-run --verbose

That workflow:

  • builds the Linux wheel in a manylinux2014 container and smoke tests it before upload
  • builds and smoke tests Python wheels on macOS and Windows with uv build
  • builds and smoke tests a source distribution
  • publishes Python distributions to PyPI with Trusted Publishing
  • verifies and publishes the scrubbers crate to crates.io

Local preflight checks:

cargo publish --dry-run --locked --package scrubbers
python3 scripts/test_python_package.py --artifact all

scripts/release.py reads the version from the Cargo manifests, checks for tracked local changes, switches to main, pulls fast-forward from origin, pushes main, creates v<version>, and pushes the tag.

Before the release workflow can publish, configure trusted publishers on both registries:

  • PyPI: add the GitHub repository/workflow as a trusted publisher for the scrubbers project and create the pypi environment.
  • crates.io: publish the crate manually once, then add this repository/workflow as a trusted publisher for scrubbers and create the crates-io environment.

TruffleHog parity workflow

TruffleHog detector coverage is tracked in src/generated_trufflehog.rs:

python scripts/sync_trufflehog_signatures.py
go run ./scripts/sync_trufflehog_pattern_fixtures.go
python scripts/verify_trufflehog_coverage.py

CI runs these commands and fails if:

  • any upstream detector directory is missing from our generated signature surface, or
  • generated signatures are missing when tests run.
  • extracted positive fixtures are missing when tests run.

The generated TruffleHog data is tracked for parity and audit purposes, but it is not applied by default as raw redaction rules. Many upstream detectors rely on keyword gating and verifier callbacks, and running their extracted regexes directly creates false positives.

src/generated_trufflehog.rs is treated as a parity inventory, not a public API surface. Generated signature names are content-addressed hashes of the pattern data, so reordering upstream extraction no longer renumbers the whole file.

The extracted positive fixtures are also used in the Rust test suite as inline redaction cases. Each case builds literal secret fragments from the upstream positive example and asserts the scrubber preserves length while masking the matched spans in place.

Benchmark

Run the native Criterion benchmark:

cargo bench --bench throughput -- --noplot

It generates a 64 MiB synthetic payload, injects multiple secret shapes, and compares:

  • raw memcpy
  • straight std::io::copy pass-through into a fixed buffer
  • scrubber/in_place
  • scrubber/stream_lines

For a quick single-number smoke run, you can still use:

cargo run --release --bin scrub-bench