Expand description
Document ingestion + safe-bundle generation for the Gaze runtime.
gaze-document turns a single image (PNG / JPG) or single-page PDF into a
SafeBundle: tokenized Markdown, a restorable gaze::Manifest, and a
structured OCR + PII BundleReport. PII detection flows through the
standard gaze::Pipeline so the manifest stays canonical and reversible
(Axis 2 reversibility).
§Quickstart
use std::path::Path;
let bundle = gaze_document::clean(
Path::new("invoice.pdf"),
Path::new("./safe-out"),
)?;
assert!(!bundle.clean_markdown.is_empty());§Runtime requirements
tesseractbinary onPATH(Tesseract 4.x or 5.x).- For PDF input: a pdfium dynamic library available to the process. See the crate README for per-OS install instructions.
§Feature flags
| Flag | Default | What it enables |
|---|---|---|
ocr-tesseract | yes | Tesseract subprocess OCR backend. |
pdf-input | yes | Single-page PDF rasterization via pdfium-render. |
serde | yes | Serialize / Deserialize for BundleReport. |
extract-docling | no | Reserved — future Docling layout adapter (no impl yet). |
render-image | no | Reserved — future redacted-preview renderer (no impl yet). |
Re-exports§
pub use bundle::clean;ocr-tesseractpub use bundle::BundleReport;pub use bundle::ClassCount;pub use bundle::LayoutSummary;pub use bundle::SafeBundle;pub use bundle::BUNDLE_VERSION;pub use layout::ReadingOrder;pub use ocr::OcrAdapter;pub use render::Renderer;
Modules§
- bundle
- SafeBundle generation: OCR + Gaze redact → on-disk artifacts.
- extract
- Input extraction backends.
- layout
- Layout / reading-order contract surface.
- mcp
mcp - MCP tool adapters for
gaze-document. - ocr
- OCR adapter contract surface and concrete backends.
- render
- Renderer contract surface.
Enums§
- Document
Error - Crate-level error type for
gaze-document.