Expand description
PDF ingestion for BookForge (ROADMAP §9b).
Layout extraction is delegated to poppler’s command-line tools;
everything after the pdftohtml -xml output is deterministic Rust:
line merging, column detection, reading order, paragraph clustering,
heading detection, and synthetic-EPUB assembly. The produced EPUB
flows through the ordinary BookForge pipeline — this crate is an
ingestion front-end, not a parallel translation path.
Re-exports§
pub use convert::ConvertOptions;pub use convert::ConvertOutcome;pub use convert::convert_pdf;pub use model::ColumnMode;pub use model::DocBlock;pub use model::Line;pub use model::Page;pub use model::Span;pub use parse::parse_pdf2xml;pub use reconstruct::reconstruct;pub use report::ConversionReport;pub use tools::PopplerTools;pub use tools::ToolError;
Modules§
- convert
- End-to-end conversion orchestration: poppler → parse → reconstruct → EPUB + report.
- epub
- Synthetic EPUB assembly from reconstructed blocks. The output is a minimal, valid, reflowable EPUB 3 that the ordinary BookForge pipeline (inspect, translate, validate, review) consumes unchanged.
- model
- Page/line intermediate representation produced by the poppler XML parser and consumed by reconstruction. Coordinates are pdftohtml’s integer pixel units, top-left origin.
- parse
- Parser for
pdftohtml -xmloutput into the page/fragment IR. - reconstruct
- Deterministic layout reconstruction: fragments → lines → columns → reading order → paragraphs/headings.
- report
- Conversion fidelity report. The contract from ROADMAP §9b: pages that reconstruct badly are flagged, never hidden.
- tools
- Discovery and invocation of poppler command-line tools.