Skip to main content

Crate docspec_docx_reader

Crate docspec_docx_reader 

Source
Expand description

DOCX to DocSpec event stream reader.

This crate provides a DocxReader that implements EventSource to convert DOCX documents into the DocSpec event stream format. It uses quick-xml for streaming XML parsing and zip for archive extraction.

§Scope

In scope: Paragraphs (<w:p>), direct text (<w:t> inside <w:r>), line breaks (<w:br> — including w:type="page" and w:type="column", all emitted as LineBreak), tabs (<w:tab>, emitted as a Text event whose content is the single character "\t"), tables (<w:tbl>, <w:tr>, <w:tc>), lists (<w:p> with <w:numPr> — ordered and unordered), hyperlinks (<w:hyperlink> — resolved via word/_rels/document.xml.rels and emitted as StartLink/EndLink events around inline content), structured document tags (<w:sdt> — content emitted normally; <w:sdtPr>/<w:sdtEndPr> dropped), and tracked insertions and moves (<w:ins>, <w:moveTo> — accept-changes semantics). Emits: StartDocument, StartParagraph, StartTextStyle, Text, EndTextStyle, LineBreak, EndParagraph, StartTable, StartTableRow, StartTableCell, StartTableHeader, EndTableHeader, EndTableCell, EndTableRow, EndTable, StartLink, EndLink, StartOrderedListItem, EndOrderedListItem, StartUnorderedListItem, EndUnorderedListItem, EndDocument.

The elements listed under “Out of scope” are the reader’s denylist — their entire subtree is silently dropped. Every other element (known or unknown) is parsed normally; the reader continues into its children.

Out of scope (subtree silently dropped):

  • Run styling not listed in the crate README
  • Headings (any <w:pStyle> value — every paragraph is StartParagraph)
  • Vertical cell merging (<w:vMerge>) — every cell emits with rowspan: None
  • Header rows in nested tables — only the outermost table honors <w:tblHeader>
  • Table-level property exceptions (<w:tblPrEx>) — silently ignored
  • Table, row, and cell visual properties (<w:tblPr>, <w:trPr> visual fields, <w:tcPr> visual fields, <w:tblGrid>)
  • Drawings and images (<w:drawing>, <w:pict>)
  • Comments, footnotes, headers, footers
  • Document metadata
  • Tracked deletions (<w:del>, <w:moveFrom>) — accept-changes semantics
  • Structured document tag properties (<w:sdtPr>, <w:sdtEndPr>)
  • Field-code hyperlinks (<w:fldChar> + <w:instrText>HYPERLINK ...): legacy form not currently supported; only the modern <w:hyperlink> element is recognized.

§Lists

See the crate README for V1 list semantics and limitations.

§Streaming Guarantee

DocxReader streams document.xml event by event using constant memory regardless of document size. _rels/.rels and word/_rels/document.xml.rels are both fully read into memory at package-open time (typical combined size < 10 KB even for large documents). word/document.xml is consumed in streaming fashion via quick-xml. The internal event queue remains bounded regardless of document size or hyperlink count.

§Quick Start

use docspec_docx_reader::{DocxReader, EventSource};

let mut reader = DocxReader::from_path("document.docx")?;
while let Some(event) = reader.next_event()? {
    println!("{event:?}");
}

Structs§

DocxReader
A streaming DOCX reader that implements EventSource.

Traits§

EventSource
Produces a stream of crate::Events from a document source.