Skip to main content

Crate bookforge_pdf

Crate bookforge_pdf 

Source
Expand description

PDF ingestion for BookForge (ROADMAP §9b).

Layout extraction is delegated to poppler’s command-line tools; everything after the pdftohtml -xml output is deterministic Rust: line merging, column detection, reading order, paragraph clustering, heading detection, and synthetic-EPUB assembly. The produced EPUB flows through the ordinary BookForge pipeline — this crate is an ingestion front-end, not a parallel translation path.

Re-exports§

pub use convert::ConvertOptions;
pub use convert::ConvertOutcome;
pub use convert::convert_pdf;
pub use model::ColumnMode;
pub use model::DocBlock;
pub use model::Line;
pub use model::Page;
pub use model::Span;
pub use parse::parse_pdf2xml;
pub use reconstruct::reconstruct;
pub use report::ConversionReport;
pub use tools::PopplerTools;
pub use tools::ToolError;

Modules§

convert
End-to-end conversion orchestration: poppler → parse → reconstruct → EPUB + report.
epub
Synthetic EPUB assembly from reconstructed blocks. The output is a minimal, valid, reflowable EPUB 3 that the ordinary BookForge pipeline (inspect, translate, validate, review) consumes unchanged.
model
Page/line intermediate representation produced by the poppler XML parser and consumed by reconstruction. Coordinates are pdftohtml’s integer pixel units, top-left origin.
parse
Parser for pdftohtml -xml output into the page/fragment IR.
reconstruct
Deterministic layout reconstruction: fragments → lines → columns → reading order → paragraphs/headings.
report
Conversion fidelity report. The contract from ROADMAP §9b: pages that reconstruct badly are flagged, never hidden.
tools
Discovery and invocation of poppler command-line tools.

Enums§

PdfError

Type Aliases§

Result