Module hocr

html_to_markdown_rs

Module hocr

Expand description

hOCR 1.2 document processing

Complete hOCR 1.2 specification support for extracting structured content from OCR documents.

§Features

Full Element Support: All 40+ hOCR 1.2 element types
Complete Property Parsing: All 20+ hOCR properties (bbox, baseline, fonts, etc.)
Document Structure: Logical hierarchy (paragraphs, sections, chapters)
Table Extraction: Spatial layout analysis for tabular data
Metadata Extraction: OCR system info, capabilities, languages

§Modules

types: Core hOCR element and property types
parser: Property parsing from title attributes
extractor: DOM to hOCR element tree extraction
converter: hOCR to Markdown conversion

§Legacy Table Extraction

The original table extraction API is maintained for backward compatibility.

Re-exports§

pub use converter::convert_to_markdown;
pub use extractor::extract_hocr_document;
pub use types::BBox;
pub use types::Baseline;
pub use types::HocrElement;
pub use types::HocrElementType;
pub use types::HocrMetadata;
pub use types::HocrProperties;

Modules§

converter: hOCR to Markdown conversion
extractor: hOCR element extraction
parser: hOCR property parser
types: hOCR 1.2 type definitions

Structs§

HocrWord: Represents a word extracted from hOCR with position and confidence information

Functions§

detect_columns: Detect column positions from word positions
detect_rows: Detect row positions from word positions
extract_hocr_words: Extract hOCR words from a DOM tree
reconstruct_table: Reconstruct table structure from words
table_to_markdown: Convert table to markdown format