Expand description
DOCX to DocSpec event stream reader.
This crate provides a DocxReader that implements EventSource to convert
DOCX documents into the DocSpec event stream format. It uses quick-xml for
streaming XML parsing and zip for archive extraction.
§Scope
In scope: Paragraphs (<w:p>), direct text (<w:t> inside <w:r>),
line breaks (<w:br> — including w:type="page" and w:type="column", all
emitted as LineBreak), tabs (<w:tab>, emitted as a Text event whose
content is the single character "\t"), tables (<w:tbl>, <w:tr>,
<w:tc>), lists (<w:p> with <w:numPr> — ordered and unordered),
hyperlinks (<w:hyperlink> — resolved via word/_rels/document.xml.rels
and emitted as StartLink/EndLink events around inline content),
structured document tags (<w:sdt> — content emitted normally;
<w:sdtPr>/<w:sdtEndPr> dropped), and tracked insertions and moves
(<w:ins>, <w:moveTo> — accept-changes semantics).
Emits: StartDocument, StartParagraph, StartTextStyle, Text,
EndTextStyle, LineBreak, EndParagraph, StartTable, StartTableRow,
StartTableCell, StartTableHeader, EndTableHeader, EndTableCell,
EndTableRow, EndTable, StartLink, EndLink, StartOrderedListItem,
EndOrderedListItem, StartUnorderedListItem, EndUnorderedListItem,
EndDocument.
The elements listed under “Out of scope” are the reader’s denylist — their entire subtree is silently dropped. Every other element (known or unknown) is parsed normally; the reader continues into its children.
Out of scope (subtree silently dropped):
- Run styling not listed in the crate README
- Headings (any
<w:pStyle>value — every paragraph isStartParagraph) - Vertical cell merging (
<w:vMerge>) — every cell emits withrowspan: None - Header rows in nested tables — only the outermost table honors
<w:tblHeader> - Table-level property exceptions (
<w:tblPrEx>) — silently ignored - Table, row, and cell visual properties (
<w:tblPr>,<w:trPr>visual fields,<w:tcPr>visual fields,<w:tblGrid>) - Drawings and images (
<w:drawing>,<w:pict>) - Comments, footnotes, headers, footers
- Document metadata
- Tracked deletions (
<w:del>,<w:moveFrom>) — accept-changes semantics - Structured document tag properties (
<w:sdtPr>,<w:sdtEndPr>) - Field-code hyperlinks (
<w:fldChar>+<w:instrText>HYPERLINK ...): legacy form not currently supported; only the modern<w:hyperlink>element is recognized.
§Lists
See the crate README for V1 list semantics and limitations.
§Streaming Guarantee
DocxReader streams document.xml event by event using constant memory
regardless of document size. _rels/.rels and
word/_rels/document.xml.rels are both fully read into memory at
package-open time (typical combined size < 10 KB even for large documents).
word/document.xml is consumed in streaming fashion via quick-xml. The
internal event queue remains bounded regardless of document size or
hyperlink count.
§Quick Start
use docspec_docx_reader::{DocxReader, EventSource};
let mut reader = DocxReader::from_path("document.docx")?;
while let Some(event) = reader.next_event()? {
println!("{event:?}");
}Structs§
- Docx
Reader - A streaming DOCX reader that implements
EventSource.
Traits§
- Event
Source - Produces a stream of
crate::Events from a document source.