gaze-document

Reversible PII pseudonymization for documents — image + single-page PDF → clean Markdown + a restorable gaze::Manifest + an OCR/PII report. Powers the gaze document clean CLI verb on top of the same gaze-pii runtime that handles streaming and structured inputs.

The crate inherits the project's north star: zero PII leaks from agent to data owner, deterministic detection, and a manifest contract that always restores. OCR is a subprocess call to the standard tesseract binary so adopters never need a native build toolchain.

Install

Library

[dependencies]
gaze-document = "0.7.2"

CLI

cargo install gaze-cli --features document

The document feature is opt-in on gaze-cli so the default install stays free of OCR / PDF dependencies.

Runtime requirements

Tesseract

gaze-document shells out to the tesseract CLI (Tesseract 4 or 5).

Platform	Install
macOS	`brew install tesseract`
Debian/Ubuntu	`sudo apt-get install tesseract-ocr`
Fedora	`sudo dnf install tesseract`
Arch	`sudo pacman -S tesseract`
Windows	`winget install --id UB-Mannheim.TesseractOCR`

If the binary is missing, clean() returns DocumentError::TesseractNotFound with a per-OS install hint in the message — fail-loud by design (Axis 1 reliability).

pdfium (only for PDF input)

PDF rasterization uses pdfium-render, which loads the pdfium shared library at runtime. Prebuilt binaries for every major OS / arch are published by bblanchon/pdfium-binaries:

Platform	What to do
macOS (arm64)	Download `pdfium-mac-arm64.tgz`; place `lib/libpdfium.dylib` on `DYLD_LIBRARY_PATH` or in `/usr/local/lib`.
macOS (x64)	Download `pdfium-mac-x64.tgz`; same placement.
Linux (x64)	Download `pdfium-linux-x64.tgz`; place `lib/libpdfium.so` on `LD_LIBRARY_PATH` or in `/usr/local/lib`.
Windows	Download `pdfium-win-x64.zip`; place `pdfium.dll` on `PATH` or next to the binary.

Image-only workflows (PNG / JPG) do not need pdfium.

Quickstart (library)

use std::path::Path;

let bundle = gaze_document::clean(
    Path::new("invoice.pdf"),
    Path::new("./safe-out"),
)?;

// Tokenized Markdown safe to hand to an LLM.
let _ = &bundle.clean_markdown;

// Restorable manifest — pair with a `gaze::Session` to round-trip.
let _ = &bundle.manifest;

// Provenance: OCR confidence + PII counts.
println!(
    "tokens={} confidence={:?}",
    bundle.report.pii_token_count,
    bundle.report.ocr_mean_confidence,
);
# Ok::<(), gaze_document::DocumentError>(())

Quickstart (CLI)

gaze document clean ./invoice.pdf --out ./safe/

Writes:

safe/
  clean.md        # OCR text with PII replaced by reversible tokens
  manifest.json   # gaze::Manifest — restorable, canonical
  report.json     # BundleReport — OCR + PII counts + provenance

Stdout carries a one-line JSON summary so callers can pipe it.

Bundle on-disk shapes

clean.md — Markdown with a short header (# gaze-document safe bundle) plus the OCR text after token substitution.
manifest.json — serialized gaze::Manifest (re-exported from gaze-types). Compatible with gaze restore and the rest of the gaze runtime.
report.json — BundleReport. Schema versioned via bundle_version: u32 = 1; field set is #[non_exhaustive] so additive fields are SemVer-safe. Includes OCR confidence, per-class PII counts, PDF metadata, and the source kind.

OCR brittleness + normalization

OCR is a lossy stage. Tesseract — like every engine — sometimes inserts spurious whitespace between adjacent glyphs that share kerning. The most common artifact in practice (and the most dangerous for axis-1 reliability) is a single space inserted next to the @ of an email:

jane.doe@example.com   →   "jane.doe @example.com"

The corrupted form is still unmistakably an email to a human or LLM but slips past strict \S+@\S+ recognizers. To keep the bundle safe to hand to a model, gaze-document applies a narrow normalization pass between the OCR adapter and the redact pipeline.

Normalization rules

The full rule set is documented in source at crates/gaze-document/src/ocr/normalize.rs. Today there is exactly one rule:

Email separator repair. Collapse intra-line horizontal whitespace immediately adjacent to @ when both sides are non-whitespace. Pattern: (\S)[ \t]*@[ \t]*(\S) → $1@$2. Newline-adjacent @ remains untouched.

Additional rules will land here as additional artifact classes are discovered. Every rule lives next to the others in ocr::normalize, doc-commented with its trigger, scope, and a worked example.

Brittleness limit

gaze-document assumes mostly-clean OCR — text where most glyphs are recognized, line breaks are preserved, and only the documented narrow artifacts (currently: whitespace around @) intrude on PII shapes. Bundles produced from low-DPI rasterization, heavy noise, or non-Latin scripts without the right --lang setting may still leak. Two mitigations land at the test boundary so future drift fails loudly:

The tests/e2e.rs fixtures assert with belt-and-braces negative substring checks (!contains("@example.com"), !contains("Jane Doe"), !contains("555-0142")) in addition to the positive :Email_, :Name_, :Custom:phone_ token assertions.
BundleReport.ocr_mean_confidence is always surfaced to adopters unmodified — no silent floor — so downstream gates can route low-confidence bundles for human review.

If you observe a new artifact class slipping through, file an issue with the OCR output and the expected normalization shape; the fix belongs in ocr::normalize alongside the existing rules.

MCP feature

Enable mcp to register two agent-tier tools with gaze-mcp-core: gaze_read_text for already-extracted text and gaze_read_file for PNG, JPG, or PDF paths. Hosts still call them through PiiEnvelope::dispatch, so args, responses, manifest rows, and auth stay on the MCP chokepoint.

use std::sync::Arc;

use gaze_document::mcp::{self, GazeReadOpts};
use gaze_mcp_core::ToolRegistry;
use gaze_mcp_rmcp::{FixedPrincipalResolver, RmcpFrontend};

let mut registry = ToolRegistry::new();
mcp::register_tools(&mut registry, GazeReadOpts::default())?;

let frontend = RmcpFrontend::stdio(Arc::new(
    FixedPrincipalResolver::agent("local-stdio"),
));
# Ok::<(), gaze_mcp_core::ToolRegistryError>(())

Both tools return a JSON object:

{
  "clean_markdown": "# gaze-document safe text\n\n...",
  "manifest_id": "01ARZ3NDEKTSV4RRFFQ69G5FAV",
  "file_metadata": {
    "source_kind": "text",
    "ocr_mean_confidence": null,
    "bundle_version": 1,
    "page_count": null
  }
}

gaze_read_file defaults to a 25 MiB input cap. Override it with GazeReadFile::with_max_file_size(bytes) or GazeReadOpts. gaze mcp install is planned as a separate gaze-cli flow; this crate only provides the opt-in tool implementations.

Feature flags

Feature	Default	What it enables
`ocr-tesseract`	yes	Tesseract subprocess OCR backend + `clean()` entry.
`pdf-input`	yes	`pdfium-render` PDF rasterization (single page).
`mcp`	no	`gaze_read_file` + `gaze_read_text` Tool impls.
`extract-docling`	no	Reserved — future Docling layout adapter.
`render-image`	no	Reserved — future redacted-preview renderer.

The extract-docling and render-image features are intentionally empty in v0.0.x so adopters can pin against the eventual flag names early.

License

Apache-2.0

gaze-document 0.8.0