Expand description
DOCX to DocSpec event stream reader.
This crate provides a DocxReader that implements EventSource to convert
DOCX documents into the DocSpec event stream format. It uses quick-xml for
streaming XML parsing and zip for archive extraction.
§Scope
In scope: Paragraphs (<w:p>), direct text (<w:t> inside <w:r>),
line breaks (<w:br> — including w:type="page" and w:type="column", all
emitted as LineBreak), tabs (<w:tab>, emitted as a Text event whose
content is the single character "\t"), tables (<w:tbl>, <w:tr>,
<w:tc>), hyperlinks (<w:hyperlink> — link text content is emitted as
plain runs), structured document tags (<w:sdt> — content emitted normally;
<w:sdtPr>/<w:sdtEndPr> dropped), and tracked insertions and moves
(<w:ins>, <w:moveTo> — accept-changes semantics).
Emits: StartDocument, StartParagraph, StartTextStyle, Text,
EndTextStyle, LineBreak, EndParagraph, StartTable, StartTableRow,
StartTableCell, StartTableHeader, EndTableHeader, EndTableCell,
EndTableRow, EndTable, EndDocument.
The elements listed under “Out of scope” are the reader’s denylist — their entire subtree is silently dropped. Every other element (known or unknown) is parsed normally; the reader continues into its children.
Out of scope (subtree silently dropped):
- Run styling not listed in the crate README
- Headings (any
<w:pStyle>value — every paragraph isStartParagraph) - Vertical cell merging (
<w:vMerge>) — every cell emits withrowspan: None - Header rows in nested tables — only the outermost table honors
<w:tblHeader> - Table-level property exceptions (
<w:tblPrEx>) — silently ignored - Table, row, and cell visual properties (
<w:tblPr>,<w:trPr>visual fields,<w:tcPr>visual fields,<w:tblGrid>) - Lists
- Drawings and images (
<w:drawing>,<w:pict>) - Comments, footnotes, headers, footers
- Document metadata
- Tracked deletions (
<w:del>,<w:moveFrom>) — accept-changes semantics - Structured document tag properties (
<w:sdtPr>,<w:sdtEndPr>)
§Streaming Guarantee
DocxReader streams document.xml event by event using constant memory
regardless of document size. Only _rels/.rels (a few hundred bytes) is
fully read into memory to discover the document target path.
§Quick Start
use docspec_docx_reader::{DocxReader, EventSource};
let mut reader = DocxReader::from_path("document.docx")?;
while let Some(event) = reader.next_event()? {
println!("{event:?}");
}Structs§
- Docx
Reader - A streaming DOCX reader that implements
EventSource.
Traits§
- Event
Source - Produces a stream of
crate::Events from a document source.