pii-masker
Rust port of the HydroXai pii-masker library, built around the same DeBERTa token-classification model and packaged as both:
- a reusable Rust library
- a stdin-friendly CLI
This crate is model-only. It does not add a regex fallback layer on top of the model output.
Install from crates.io:
What Is Bundled
The crate embeds the small model artifacts it needs at compile time:
config.jsontokenizer.json
The large weight file is not embedded in the binary by default:
model.safetensors
That keeps the compiled binary reasonably small and makes container deployment much cleaner.
Model Resolution
When you call PiiMasker::new(), the crate resolves weights in this order:
PII_MASKER_MODEL_WEIGHTSmodel/model.safetensorsinside the repo or deployed app directory- Hugging Face download from
hydroxai/pii_model_weight
If you want deterministic deployment behavior, set the weights path explicitly with the builder or set PII_MASKER_MODEL_WEIGHTS.
Library Usage
Add the crate as a dependency:
[]
= "0.1.0"
Basic usage:
use PiiMasker;
Example output:
John Doe lives at [ADDRESS].
The main library types are:
PiiMaskerPiiMaskerBuilderMaskResultPiiEntity
CLI Usage
Run the CLI in release mode:
Pipe text on stdin:
|
Or pass text as an argument:
Use a specific weights file:
|
Or place the weights at:
model/model.safetensors
and let PiiMasker::new() or the CLI pick them up automatically.
Plain-text output:
John Doe lives at [ADDRESS].
JSON output:
Performance Notes
Use --release for anything real.
cargo run --bin pii-mask uses the debug profile, and model startup is much slower there. The model is also loaded fresh for each process invocation, so short-lived CLI runs pay the startup cost every time.
For services or web apps:
- construct
PiiMaskeronce at process startup - reuse it for all requests
- avoid rebuilding the model for each call
Container Deployment
The recommended deployment shape is:
- compile your app or CLI in a builder stage
- copy the binary into a slim runtime image
- copy
model.safetensorsinto the image filesystem - set
PII_MASKER_MODEL_WEIGHTS
Example Dockerfile:
FROM rust:1.86-bookworm AS builder
WORKDIR /src
COPY pii-masker ./pii-masker
WORKDIR /src/pii-masker
RUN cargo build --release --bin pii-mask
FROM debian:bookworm-slim
WORKDIR /app
COPY --from=builder /src/pii-masker/target/release/pii-mask /usr/local/bin/pii-mask
COPY model.safetensors /app/model/model.safetensors
ENV PII_MASKER_MODEL_WEIGHTS=/app/model/model.safetensors
ENTRYPOINT ["pii-mask"]
For local development, a simple layout is:
pii-masker/
model/
model.safetensors
For a server that uses the library instead of the CLI, the pattern is the same:
- copy your server binary into the image
- copy
model.safetensors - set
PII_MASKER_MODEL_WEIGHTS - initialize
PiiMaskeronce on startup
Single-Binary Deployment
If you absolutely need a single self-contained binary, you can embed model.safetensors with include_bytes! and switch to a buffered safetensors loader.
That is possible, but it is usually a bad tradeoff:
- the binary grows by roughly 700 MB
- cold start gets worse
- rebuilds get slower
- container layering gets less efficient
Bundling the model as a file in the image is the better default.
Testing
Run the library and CLI tests:
The test suite includes:
- pure unit tests for label normalization and span trimming
- optional model-backed smoke tests when
PII_MASKER_TEST_MODEL_WEIGHTSormodel/model.safetensorsis available
Publishing
This repo follows the same basic crates.io flow used in several of the other Rust projects in ~/code/rust:
ci.ymlruns format, clippy, tests, and release build on pushes tomainand pull requests- on pushes to
main, the workflow checks whether the current version already exists on crates.io - if the version does not exist and
CARGO_REGISTRY_TOKENis configured in GitHub Actions secrets, the crate is published automatically release.ymlcreates a GitHub Release when you push avX.Y.Ztag
To enable crates.io publishing in GitHub Actions, add this repository secret:
CARGO_REGISTRY_TOKEN
Example release flow:
- Bump the version in
Cargo.toml. - Merge to
main. - CI publishes to crates.io if that version is not already published.
- Push a tag like
v0.1.0to create the GitHub release page.
Current Behavior
This crate mirrors the model behavior closely. If the underlying model misses or truncates an entity, the Rust port will do the same.
For example, phone-number masking is currently weaker than email or address masking because the model itself is weaker there. This crate does not paper over that with regexes.