gaze-document 0.10.1

Reversible PII pseudonymization for documents — Tesseract OCR + Gaze redact → SafeBundle (clean Markdown + manifest + report).
Documentation

gaze-document

Crates.io docs.rs License

Reversible PII pseudonymization for documents — image + single-page PDF → clean Markdown + a restorable gaze::Manifest + an OCR/PII report. Powers the gaze document clean CLI verb on top of the same gaze-pii runtime that handles streaming and structured inputs.

The crate inherits the project's north star: zero PII leaks from agent to data owner, deterministic detection, and a manifest contract that always restores. OCR is a subprocess call to the standard tesseract binary so adopters never need a native build toolchain.

Install

Library

[dependencies]
gaze-document = "0.10.1"

CLI

cargo install gaze-cli --version 0.10.1 --features document

The document feature is opt-in on gaze-cli so the default install stays free of OCR / PDF dependencies.

Runtime requirements

Tesseract

gaze-document shells out to the tesseract CLI (Tesseract 4 or 5).

Platform Install
macOS brew install tesseract
Debian/Ubuntu sudo apt-get install tesseract-ocr
Fedora sudo dnf install tesseract
Arch sudo pacman -S tesseract
Windows winget install --id UB-Mannheim.TesseractOCR

If the binary is missing, clean() returns DocumentError::TesseractNotFound with a per-OS install hint in the message — fail-loud by design (Axis 1 reliability).

pdfium (only for PDF input)

PDF rasterization uses pdfium-render, which loads the pdfium shared library at runtime. Prebuilt binaries for every major OS / arch are published by bblanchon/pdfium-binaries:

Platform What to do
macOS (arm64) Download pdfium-mac-arm64.tgz; place lib/libpdfium.dylib on DYLD_LIBRARY_PATH or in /usr/local/lib.
macOS (x64) Download pdfium-mac-x64.tgz; same placement.
Linux (x64) Download pdfium-linux-x64.tgz; place lib/libpdfium.so on LD_LIBRARY_PATH or in /usr/local/lib.
Windows Download pdfium-win-x64.zip; place pdfium.dll on PATH or next to the binary.

Image-only workflows (PNG / JPG) do not need pdfium.

Quickstart (library)

use std::path::Path;

let bundle = gaze_document::clean(
    Path::new("invoice.pdf"),
    gaze_document::AgentBundleDir::new("./agent-bundle")?,
    gaze_document::OwnerBundleDir::new("./owner-vault")?,
)?;

// Tokenized Markdown safe to hand to an LLM.
let _ = &bundle.clean_markdown;

// Restorable manifest — pair with a `gaze::Session` to round-trip.
let _ = &bundle.manifest;

// Provenance: per-page extraction confidence + PII counts.
println!(
    "tokens={} first_page_confidence={:?}",
    bundle.report.pii_token_count,
    bundle.report.pages.first().and_then(|page| page.confidence),
);
# Ok::<(), gaze_document::DocumentError>(())

Quickstart (CLI)

# Convenience shorthand: --out creates agent/ + owner/ subdirs
gaze document clean ./invoice.pdf --out ./safe/

# Explicit: caller controls both paths
gaze document clean ./invoice.pdf --agent-out ./agent-bundle/ --owner-out ./owner-vault/

Writes:

agent/
  clean.md        # OCR text with PII replaced by reversible tokens
  report.json     # BundleReport — OCR + PII counts + provenance
owner/
  manifest.json   # gaze::Manifest — restorable, canonical

Stdout carries a one-line JSON summary so callers can pipe it.

manifest.json carries restorable PII mapping material. It belongs in an owner-only path; uploading it alongside clean.md to an LLM workspace defeats pseudonymization. The split layout makes that axis-1 boundary a runtime contract instead of caller discipline.

Bundle on-disk shapes

  • agent/clean.md — Markdown with a short header (# gaze-document safe bundle) plus the OCR text after token substitution.
  • owner/manifest.json — serialized gaze::Manifest (re-exported from gaze-types). Compatible with gaze restore and the rest of the gaze runtime.
  • agent/report.jsonBundleReport. Schema versioned via bundle_version: u32 = 2; field set is #[non_exhaustive] so additive fields are SemVer-safe. Includes per-page extraction source (vector_pdf or ocr), OCR backend, normalized confidence, low-confidence flag, column count, per-class PII counts, PDF metadata, and the source kind. Existing v1 reports still deserialize; new emission is always v2. Full field-by-field catalog with stability per field: docs/metrics.md.

OCR brittleness + normalization

OCR is a lossy stage. Tesseract — like every engine — sometimes inserts spurious whitespace between adjacent glyphs that share kerning. The most common artifact in practice (and the most dangerous for axis-1 reliability) is a single space inserted next to the @ of an email:

jane.doe@example.invalid   →   "jane.doe @example.invalid"

The corrupted form is still unmistakably an email to a human or LLM but slips past strict \S+@\S+ recognizers. To keep the bundle safe to hand to a model, gaze-document applies a narrow normalization pass between the OCR adapter and the redact pipeline.

Normalization rules

The full rule set is documented in source at crates/gaze-document/src/ocr/normalize.rs. Today there is exactly one rule:

  • Email separator repair. Collapse intra-line horizontal whitespace immediately adjacent to @ when both sides are non-whitespace. Pattern: (\S)[ \t]*@[ \t]*(\S)$1@$2. Newline-adjacent @ remains untouched.

Additional rules will land here as additional artifact classes are discovered. Every rule lives next to the others in ocr::normalize, doc-commented with its trigger, scope, and a worked example.

Brittleness limit

gaze-document assumes mostly-clean OCR — text where most glyphs are recognized, line breaks are preserved, and only the documented narrow artifacts (currently: whitespace around @) intrude on PII shapes. Bundles produced from low-DPI rasterization, heavy noise, or non-Latin scripts without the right --lang setting may still leak. Two mitigations land at the test boundary so future drift fails loudly:

  • The tests/e2e.rs fixtures assert with belt-and-braces negative substring checks (!contains("@example.invalid"), !contains("Jane Doe"), !contains("555-0142")) in addition to the positive :Email_, :Name_, :Custom:phone_ token assertions.
  • BundleReport.pages[].confidence and pages[].low_confidence are always surfaced to adopters. The default threshold is 0.65, configurable with gaze_document::Pipeline::with_low_confidence_threshold(), so downstream gates can route low-confidence pages for human review.

If you observe a new artifact class slipping through, file an issue with the OCR output and the expected normalization shape; the fix belongs in ocr::normalize alongside the existing rules.

MCP feature

Enable mcp to register two agent-tier tools with gaze-mcp-core: gaze_read_text for already-extracted text and gaze_read_file for PNG, JPG, or PDF paths. Hosts still call them through PiiEnvelope::dispatch, so args, responses, manifest rows, and auth stay on the MCP chokepoint.

use std::sync::Arc;

use gaze_document::mcp::{self, GazeReadOpts};
use gaze_mcp_core::ToolRegistry;
use gaze_mcp_rmcp::{FixedPrincipalResolver, RmcpFrontend};

let mut registry = ToolRegistry::new();
mcp::register_tools(&mut registry, GazeReadOpts::default())?;

let frontend = RmcpFrontend::stdio(Arc::new(
    FixedPrincipalResolver::agent("local-stdio"),
));
# Ok::<(), gaze_mcp_core::ToolRegistryError>(())

Both tools return a JSON object:

{
  "clean_markdown": "# gaze-document safe text\n\n...",
  "manifest_id": "01ARZ3NDEKTSV4RRFFQ69G5FAV",
  "file_metadata": {
    "source_kind": "text",
    "ocr_mean_confidence": null,
    "bundle_version": 2,
    "page_count": null
  }
}

gaze_read_file defaults to a 25 MiB input cap. Override it with GazeReadFile::with_max_file_size(bytes) or GazeReadOpts. gaze-cli provides gaze mcp install, gaze mcp doctor, and gaze mcp serve; this crate only provides the opt-in tool implementations.

Feature flags

Feature Default What it enables
ocr-tesseract yes Tesseract subprocess OCR backend + clean() entry.
pdf-input yes pdfium-render PDF text extraction + raster OCR fallback.
mcp no gaze_read_file + gaze_read_text Tool impls.
extract-docling no Reserved — future Docling layout adapter.
render-image no Reserved — future redacted-preview renderer.

The extract-docling and render-image features are intentionally empty in v0.10.0 so adopters can pin against the eventual flag names early.

License

Dual-licensed under either of Apache-2.0 or MIT, at your option.