Module hocr

Module hocr 

Source
Expand description

hOCR 1.2 document processing

Complete hOCR 1.2 specification support for extracting structured content from OCR documents.

§Features

  • Full Element Support: All 40+ hOCR 1.2 element types
  • Complete Property Parsing: All 20+ hOCR properties (bbox, baseline, fonts, etc.)
  • Document Structure: Logical hierarchy (paragraphs, sections, chapters)
  • Table Extraction: Spatial layout analysis for tabular data
  • Metadata Extraction: OCR system info, capabilities, languages

§Modules

  • types: Core hOCR element and property types
  • parser: Property parsing from title attributes
  • extractor: DOM to hOCR element tree extraction
  • converter: hOCR to Markdown conversion

§Legacy Table Extraction

The original table extraction API is maintained for backward compatibility.

Re-exports§

pub use converter::convert_to_markdown;
pub use extractor::extract_hocr_document;
pub use types::BBox;
pub use types::Baseline;
pub use types::HocrElement;
pub use types::HocrElementType;
pub use types::HocrMetadata;
pub use types::HocrProperties;

Modules§

converter
hOCR to Markdown conversion
extractor
hOCR element extraction
parser
hOCR property parser
types
hOCR 1.2 type definitions

Structs§

HocrWord
Represents a word extracted from hOCR with position and confidence information

Functions§

detect_columns
Detect column positions from word positions
detect_rows
Detect row positions from word positions
extract_hocr_words
Extract hOCR words from a DOM tree
reconstruct_table
Reconstruct table structure from words
table_to_markdown
Convert table to markdown format