Expand description
Document ingestion + safe-bundle generation for the Gaze runtime.
gaze-document turns a single image (PNG / JPG) or PDF into a
SafeBundle: tokenized Markdown, a restorable gaze::Manifest, and a
structured OCR + PII BundleReport. PII detection flows through the
standard gaze::Pipeline so the manifest stays canonical and reversible
(Axis 2 reversibility).
§Quickstart
use std::path::Path;
let bundle = gaze_document::clean(
Path::new("invoice.pdf"),
Path::new("./safe-out"),
)?;
assert!(!bundle.clean_markdown.is_empty());§Runtime requirements
tesseractbinary onPATH(Tesseract 4.x or 5.x).- For PDF input: a pdfium dynamic library available to the process. See the crate README for per-OS install instructions.
§Feature flags
| Flag | Default | What it enables |
|---|---|---|
ocr-tesseract | yes | Tesseract subprocess OCR backend. |
pdf-input | yes | PDF text extraction + raster OCR fallback via pdfium-render. |
serde | yes | Serialize / Deserialize for BundleReport. |
extract-docling | no | Reserved — future Docling layout adapter (no impl yet). |
render-image | no | Reserved — future redacted-preview renderer (no impl yet). |
Re-exports§
pub use bundle::clean;ocr-tesseractpub use bundle::clean_with_ocr_backend;ocr-tesseractpub use bundle::BundleReport;pub use bundle::ClassCount;pub use bundle::LayoutSummary;pub use bundle::OcrSource;pub use bundle::PageReport;pub use bundle::Pipeline;pub use bundle::SafeBundle;pub use bundle::BUNDLE_VERSION;pub use layout::ReadingOrder;pub use ocr::TesseractBackend;ocr-tesseractpub use ocr::detect_image_format;pub use ocr::BBox;pub use ocr::ImageFormat;pub use ocr::ImageInput;pub use ocr::LanguageTag;pub use ocr::OcrBackend;pub use ocr::OcrError;pub use ocr::OcrHints;pub use ocr::OcrSpan;pub use render::Renderer;
Modules§
- bundle
- SafeBundle generation: OCR + Gaze redact → on-disk artifacts.
- extract
- Input extraction backends.
- layout
- Layout / reading-order helpers.
- mcp
mcp - MCP tool adapters for
gaze-document. - ocr
- OCR backend contract surface and concrete backends.
- render
- Renderer contract surface.
Enums§
- Document
Error - Crate-level error type for
gaze-document.