gaze-document
Reversible PII pseudonymization for documents — image + single-page PDF →
clean Markdown + a restorable gaze::Manifest + an OCR/PII report. Powers
the gaze document clean CLI verb on top of the same gaze-pii runtime
that handles streaming and structured inputs.
The crate inherits the project's north star: zero PII
leaks from agent to data owner, deterministic detection, and a manifest
contract that always restores. OCR is a subprocess call to the standard
tesseract binary so adopters never need a native build toolchain.
Install
Library
[]
= "0.7.2"
CLI
The document feature is opt-in on gaze-cli so the default install stays
free of OCR / PDF dependencies.
Runtime requirements
Tesseract
gaze-document shells out to the tesseract CLI (Tesseract 4 or 5).
| Platform | Install |
|---|---|
| macOS | brew install tesseract |
| Debian/Ubuntu | sudo apt-get install tesseract-ocr |
| Fedora | sudo dnf install tesseract |
| Arch | sudo pacman -S tesseract |
| Windows | winget install --id UB-Mannheim.TesseractOCR |
If the binary is missing, clean() returns
DocumentError::TesseractNotFound with a per-OS install hint in the
message — fail-loud by design (Axis 1 reliability).
pdfium (only for PDF input)
PDF rasterization uses pdfium-render,
which loads the pdfium shared library at runtime. Prebuilt binaries
for every major OS / arch are published by
bblanchon/pdfium-binaries:
| Platform | What to do |
|---|---|
| macOS (arm64) | Download pdfium-mac-arm64.tgz; place lib/libpdfium.dylib on DYLD_LIBRARY_PATH or in /usr/local/lib. |
| macOS (x64) | Download pdfium-mac-x64.tgz; same placement. |
| Linux (x64) | Download pdfium-linux-x64.tgz; place lib/libpdfium.so on LD_LIBRARY_PATH or in /usr/local/lib. |
| Windows | Download pdfium-win-x64.zip; place pdfium.dll on PATH or next to the binary. |
Image-only workflows (PNG / JPG) do not need pdfium.
Quickstart (library)
use Path;
let bundle = clean?;
// Tokenized Markdown safe to hand to an LLM.
let _ = &bundle.clean_markdown;
// Restorable manifest — pair with a `gaze::Session` to round-trip.
let _ = &bundle.manifest;
// Provenance: OCR confidence + PII counts.
println!;
# Ok::
Quickstart (CLI)
Writes:
safe/
clean.md # OCR text with PII replaced by reversible tokens
manifest.json # gaze::Manifest — restorable, canonical
report.json # BundleReport — OCR + PII counts + provenance
Stdout carries a one-line JSON summary so callers can pipe it.
Bundle on-disk shapes
clean.md— Markdown with a short header (# gaze-document safe bundle) plus the OCR text after token substitution.manifest.json— serializedgaze::Manifest(re-exported fromgaze-types). Compatible withgaze restoreand the rest of thegazeruntime.report.json—BundleReport. Schema versioned viabundle_version: u32 = 1; field set is#[non_exhaustive]so additive fields are SemVer-safe. Includes OCR confidence, per-class PII counts, PDF metadata, and the source kind.
OCR brittleness + normalization
OCR is a lossy stage. Tesseract — like every engine — sometimes inserts
spurious whitespace between adjacent glyphs that share kerning. The most
common artifact in practice (and the most dangerous for axis-1
reliability) is a single space inserted next to the @ of an email:
jane.doe@example.com → "jane.doe @example.com"
The corrupted form is still unmistakably an email to a human or LLM but
slips past strict \S+@\S+ recognizers. To keep the bundle safe to hand
to a model, gaze-document applies a narrow normalization pass between
the OCR adapter and the redact pipeline.
Normalization rules
The full rule set is documented in source at
crates/gaze-document/src/ocr/normalize.rs. Today there is exactly one
rule:
- Email separator repair. Collapse intra-line horizontal whitespace
immediately adjacent to
@when both sides are non-whitespace. Pattern:(\S)[ \t]*@[ \t]*(\S)→$1@$2. Newline-adjacent@remains untouched.
Additional rules will land here as additional artifact classes are
discovered. Every rule lives next to the others in
ocr::normalize, doc-commented with its trigger, scope, and a worked
example.
Brittleness limit
gaze-document assumes mostly-clean OCR — text where most glyphs
are recognized, line breaks are preserved, and only the documented
narrow artifacts (currently: whitespace around @) intrude on PII
shapes. Bundles produced from low-DPI rasterization, heavy noise, or
non-Latin scripts without the right --lang setting may still leak.
Two mitigations land at the test boundary so future drift fails loudly:
- The
tests/e2e.rsfixtures assert with belt-and-braces negative substring checks (!contains("@example.com"),!contains("Jane Doe"),!contains("555-0142")) in addition to the positive:Email_,:Name_,:Custom:phone_token assertions. BundleReport.ocr_mean_confidenceis always surfaced to adopters unmodified — no silent floor — so downstream gates can route low-confidence bundles for human review.
If you observe a new artifact class slipping through, file an issue
with the OCR output and the expected normalization shape; the fix
belongs in ocr::normalize alongside the existing rules.
MCP feature
Enable mcp to register two agent-tier tools with gaze-mcp-core:
gaze_read_text for already-extracted text and gaze_read_file for PNG,
JPG, or PDF paths. Hosts still call them through PiiEnvelope::dispatch,
so args, responses, manifest rows, and auth stay on the MCP chokepoint.
use Arc;
use ;
use ToolRegistry;
use ;
let mut registry = new;
register_tools?;
let frontend = stdio;
# Ok::
Both tools return a JSON object:
gaze_read_file defaults to a 25 MiB input cap. Override it with
GazeReadFile::with_max_file_size(bytes) or GazeReadOpts.
gaze mcp install is planned as a separate gaze-cli flow; this crate only
provides the opt-in tool implementations.
Feature flags
| Feature | Default | What it enables |
|---|---|---|
ocr-tesseract |
yes | Tesseract subprocess OCR backend + clean() entry. |
pdf-input |
yes | pdfium-render PDF rasterization (single page). |
mcp |
no | gaze_read_file + gaze_read_text Tool impls. |
extract-docling |
no | Reserved — future Docling layout adapter. |
render-image |
no | Reserved — future redacted-preview renderer. |
The extract-docling and render-image features are intentionally empty
in v0.0.x so adopters can pin against the eventual flag names early.
License
Apache-2.0