Expand description
Parser for pdftohtml -xml output into the page/fragment IR.
The format, per page:
<page number="1" width="918" height="1188" ...>
<fontspec id="0" size="17" family="Times" color="#000000"/>
<text top="246" left="261" width="394" height="18" font="0">Line with <b>bold</b></text>
</page><text> content may nest <b>, <i> (and occasionally <a>);
styling is flattened into spans, anchors into plain text.