Expand description
SafeBundle generation: OCR + Gaze redact → on-disk artifacts.
The top-level clean function is the public adopter entry point. It
routes any supported input (PNG / JPG / single-page PDF) through OCR,
pipes the extracted text through a gaze::Pipeline, and persists the
result as three files split across agent and owner target directories:
agent_out/
clean.md # OCR text with PII replaced by reversible tokens
report.json # BundleReport — OCR + PII counts + provenance
owner_out/
manifest.json # gaze::Manifest — restorable, canonicalThe manifest contract is the same one the rest of the gaze runtime
uses (gaze::Manifest). Because it carries restore material, it is written
only to the owner output directory.
Structs§
- Agent
Bundle Dir - Agent-visible SafeBundle output directory.
- Bundle
Report - Bundle audit + provenance report serialized to
report.json. - Class
Count - Per-class PII detection count for
BundleReport. - Layout
Summary - Opaque layout summary placeholder.
- Owner
Bundle Dir - Owner-only SafeBundle output directory.
- Page
Report - Per-page OCR/layout provenance.
- Pipeline
- Configurable document-cleaning pipeline.
- Safe
Bundle - Post-ingestion artifact paired with a Gaze
Manifest.
Enums§
- OcrSource
- Per-page extraction source.
Constants§
- BUNDLE_
VERSION - Versioned
report.jsonschema tag (bump on breaking shape changes). - CLEAN_
MARKDOWN_ FILE - Bundle filename written into the agent output directory for tokenized Markdown.
- MANIFEST_
FILE - Bundle filename written into the owner output directory for the restorable manifest.
- REPORT_
FILE - Bundle filename written into the agent output directory for the OCR + PII provenance report.
Functions§
- clean
ocr-tesseract - Top-level entry point: ingest one document, write a
SafeBundleto disk. - clean_
with_ ocr_ backend ocr-tesseract - Top-level entry point with an adopter-supplied OCR backend.