Skip to main content

Crate gaze_document

Crate gaze_document 

Source
Expand description

Document ingestion + safe-bundle generation for the Gaze runtime.

gaze-document turns a single image (PNG / JPG) or PDF into a SafeBundle: tokenized Markdown, a restorable gaze::Manifest, and a structured OCR + PII BundleReport. PII detection flows through the standard gaze::Pipeline so the manifest stays canonical and reversible (Axis 2 reversibility).

§Quickstart

use std::path::Path;

let bundle = gaze_document::clean(
    Path::new("invoice.pdf"),
    Path::new("./safe-out"),
)?;
assert!(!bundle.clean_markdown.is_empty());

§Runtime requirements

  • tesseract binary on PATH (Tesseract 4.x or 5.x).
  • For PDF input: a pdfium dynamic library available to the process. See the crate README for per-OS install instructions.

§Feature flags

FlagDefaultWhat it enables
ocr-tesseractyesTesseract subprocess OCR backend.
pdf-inputyesPDF text extraction + raster OCR fallback via pdfium-render.
serdeyesSerialize / Deserialize for BundleReport.
extract-doclingnoReserved — future Docling layout adapter (no impl yet).
render-imagenoReserved — future redacted-preview renderer (no impl yet).

Re-exports§

pub use bundle::clean;ocr-tesseract
pub use bundle::clean_with_ocr_backend;ocr-tesseract
pub use bundle::BundleReport;
pub use bundle::ClassCount;
pub use bundle::LayoutSummary;
pub use bundle::OcrSource;
pub use bundle::PageReport;
pub use bundle::Pipeline;
pub use bundle::SafeBundle;
pub use bundle::BUNDLE_VERSION;
pub use layout::ReadingOrder;
pub use ocr::TesseractBackend;ocr-tesseract
pub use ocr::detect_image_format;
pub use ocr::BBox;
pub use ocr::ImageFormat;
pub use ocr::ImageInput;
pub use ocr::LanguageTag;
pub use ocr::OcrBackend;
pub use ocr::OcrError;
pub use ocr::OcrHints;
pub use ocr::OcrSpan;
pub use render::Renderer;

Modules§

bundle
SafeBundle generation: OCR + Gaze redact → on-disk artifacts.
extract
Input extraction backends.
layout
Layout / reading-order helpers.
mcpmcp
MCP tool adapters for gaze-document.
ocr
OCR backend contract surface and concrete backends.
render
Renderer contract surface.

Enums§

DocumentError
Crate-level error type for gaze-document.