Module hocr

Module hocr 

Source
Expand description

hOCR 1.2 document processing.

Complete hOCR 1.2 specification support for extracting structured content from OCR documents.

§Features

  • Full Element Support: All 40+ hOCR 1.2 element types
  • Complete Property Parsing: All 20+ hOCR properties (bbox, baseline, fonts, etc.)
  • Document Structure: Logical hierarchy (paragraphs, sections, chapters)
  • Spatial Table Reconstruction: Automatic table detection from bbox coordinates
  • Metadata Extraction: OCR system info, capabilities, languages

§Modules

  • types: Core hOCR element and property types
  • parser: Property parsing from title attributes
  • extractor: DOM to hOCR element tree extraction
  • converter: hOCR to Markdown conversion
  • spatial: Spatial table reconstruction from bounding boxes

Re-exports§

pub use converter::convert_to_markdown;
pub use extractor::extract_hocr_document;
pub use spatial::extract_hocr_words;
pub use spatial::reconstruct_table;
pub use spatial::table_to_markdown;
pub use spatial::HocrWord;
pub use types::BBox;
pub use types::Baseline;
pub use types::HocrElement;
pub use types::HocrElementType;
pub use types::HocrMetadata;
pub use types::HocrProperties;

Modules§

converter
hOCR to Markdown conversion
extractor
hOCR element extraction
parser
hOCR property parser
spatial
Spatial table reconstruction from hOCR bounding box coordinates
types
hOCR 1.2 type definitions