Expand description
hOCR 1.2 document processing
Complete hOCR 1.2 specification support for extracting structured content from OCR documents.
§Features
- Full Element Support: All 40+ hOCR 1.2 element types
- Complete Property Parsing: All 20+ hOCR properties (bbox, baseline, fonts, etc.)
- Document Structure: Logical hierarchy (paragraphs, sections, chapters)
- Table Extraction: Spatial layout analysis for tabular data
- Metadata Extraction: OCR system info, capabilities, languages
§Modules
types: Core hOCR element and property typesparser: Property parsing from title attributesextractor: DOM to hOCR element tree extractionconverter: hOCR to Markdown conversion
§Legacy Table Extraction
The original table extraction API is maintained for backward compatibility.
Re-exports§
pub use converter::convert_to_markdown;pub use extractor::extract_hocr_document;pub use types::BBox;pub use types::Baseline;pub use types::HocrElement;pub use types::HocrElementType;pub use types::HocrMetadata;pub use types::HocrProperties;
Modules§
- converter
- hOCR to Markdown conversion
- extractor
- hOCR element extraction
- parser
- hOCR property parser
- types
- hOCR 1.2 type definitions
Structs§
- Hocr
Word - Represents a word extracted from hOCR with position and confidence information
Functions§
- detect_
columns - Detect column positions from word positions
- detect_
rows - Detect row positions from word positions
- extract_
hocr_ words - Extract hOCR words from a DOM tree
- reconstruct_
table - Reconstruct table structure from words
- table_
to_ markdown - Convert table to markdown format