docspec-docx-reader 1.7.1

DOCX to DocSpec event stream reader
Documentation

docspec-docx-reader

Streaming DOCX to DocSpec event stream reader.

See the main DocSpec repository for documentation, architecture, and the event protocol.

Supported

  • Paragraphs (<w:p>) and direct text (<w:t> inside <w:r>)
  • Line breaks (<w:br>, including w:type="page" and w:type="column" — all emit LineBreak)
  • Tabs (<w:tab> — emitted as a Text event containing the single character "\t")
  • Tables (<w:tbl>, <w:tr>, <w:tc>) — emitted as structural events only; cell merging, header rows, and table styles are not represented
  • Run properties (<w:rPr>): <w:b> (bold), <w:i> (italic), <w:u> (underline — any w:val other than none), <w:strike> (strikethrough), <w:dstrike> (double-strike, collapses to strikethrough), <w:vertAlign> (subscript and superscript only; baseline resets to neither)
  • Paragraph properties (<w:pPr>): <w:jc> (alignment — left/start to Left, right/end to Right, center to Center, both/distribute to Justify)
  • Empty <w:rPr/> and <w:pPr/> are treated as no properties (default style / alignment None)
  • A <w:rPr> or <w:pPr> that appears after content in the same parent is silently ignored (per the OOXML spec, both must be the first child element)
  • Emits: StartDocument, StartParagraph, Text, LineBreak, EndParagraph, StartTable, StartTableRow, StartTableCell, EndTableCell, EndTableRow, EndTable, EndDocument
  • Compression: Stored and Deflated only

Out of Scope (silently dropped)

  • Headings (any <w:pStyle> value — every paragraph is StartParagraph)
  • Style references (<w:rStyle>, <w:pStyle>)
  • Run formatting not listed above: <w:color>, <w:sz>, <w:szCs>, <w:rFonts>, <w:shd>, <w:highlight>, <w:caps>, <w:smallCaps>, <w:position>, <w:spacing>, <w:kern>, <w:lang>, <w:noProof>
  • Revision tracking (<w:rPrChange>, <w:pPrChange>)
  • Advanced paragraph layout beyond alignment: <w:numPr>, <w:ind>, <w:tabs>, <w:framePr>, <w:sectPr>
  • <w:rPr> nested inside <w:pPr> (paragraph mark / pilcrow run properties)
  • BiDi-aware logical alignment (start/end flipping based on paragraph direction is not tracked)
  • Math (m:rPr) and DrawingML (a:rPr) namespaces
  • Cell merging (<w:gridSpan>, <w:vMerge>) — every cell emits with colspan: None and rowspan: None
  • Header rows (<w:tblHeader>) — every cell emits as StartTableCell, never StartTableHeader
  • Table, row, and cell properties (<w:tblPr>, <w:trPr>, <w:tcPr>, <w:tblGrid>)
  • Lists
  • Hyperlinks (<w:hyperlink>)
  • Drawings and images (<w:drawing>, <w:pict>)
  • Structured document tags (<w:sdt>)
  • Comments, footnotes, headers, footers
  • Document metadata
  • Tracked changes (<w:ins>, <w:del>, <w:moveFrom>, <w:moveTo>)

Streaming Guarantee

DocxReader streams document.xml event by event using constant memory regardless of document size. Only _rels/.rels (a few hundred bytes) is fully read into memory to discover the document target path.

Quick Start

use docspec_docx_reader::{DocxReader, EventSource};

let mut reader = DocxReader::from_path("document.docx")?;
while let Some(event) = reader.next_event()? {
    println!("{event:?}");
}
# Ok::<(), docspec_core::Error>(())

See Also