Expand description
SafeBundle generation: OCR + Gaze redact → on-disk artifacts.
The top-level clean function is the public adopter entry point. It
routes any supported input (PNG / JPG / single-page PDF) through OCR,
pipes the extracted text through a gaze::Pipeline, and persists the
result as three files in a target directory:
out/
clean.md # OCR text with PII replaced by reversible tokens
manifest.json # gaze::Manifest — restorable, canonical
report.json # BundleReport — OCR + PII counts + provenanceThe manifest contract is the same one the rest of the gaze runtime
uses (gaze::Manifest). Adopters can pair clean.md with manifest.json
and restore via the standard gaze session APIs.
Structs§
- Bundle
Report - Bundle audit + provenance report serialized to
report.json. - Class
Count - Per-class PII detection count for
BundleReport. - Layout
Summary - Opaque layout summary placeholder.
- Page
Report - Per-page OCR/layout provenance.
- Pipeline
- Configurable document-cleaning pipeline.
- Safe
Bundle - Post-ingestion artifact paired with a Gaze
Manifest.
Enums§
- OcrSource
- Per-page extraction source.
Constants§
- BUNDLE_
VERSION - Versioned
report.jsonschema tag (bump on breaking shape changes). - CLEAN_
MARKDOWN_ FILE - Bundle filename written into
--outfor tokenized Markdown. - MANIFEST_
FILE - Bundle filename written into
--outfor the restorable manifest. - REPORT_
FILE - Bundle filename written into
--outfor the OCR + PII provenance report.
Functions§
- clean
ocr-tesseract - Top-level entry point: ingest one document, write a
SafeBundleto disk. - clean_
with_ ocr_ backend ocr-tesseract - Top-level entry point with an adopter-supplied OCR backend.