Skip to main content

Module parse

Module parse 

Source
Expand description

Parser for pdftohtml -xml output into the page/fragment IR.

The format, per page:

<page number="1" width="918" height="1188" ...>
  <fontspec id="0" size="17" family="Times" color="#000000"/>
  <text top="246" left="261" width="394" height="18" font="0">Line with <b>bold</b></text>
</page>

<text> content may nest <b>, <i> (and occasionally <a>); styling is flattened into spans, anchors into plain text.

Functionsยง

parse_pdf2xml