Expand description
DOCX to DocSpec event stream reader.
This crate provides a DocxReader that implements EventSource to convert
DOCX documents into the DocSpec event stream format. It uses quick-xml for
streaming XML parsing and zip for archive extraction.
§Scope
In scope: Paragraphs (<w:p>), direct text (<w:t> inside <w:r>),
line breaks (<w:br> — including w:type="page" and w:type="column", all
emitted as LineBreak), tabs (<w:tab>, emitted as a Text event whose
content is the single character "\t"), and tables (<w:tbl>, <w:tr>,
<w:tc> — emitted as structural events only; cell merging, header rows, and
table styles are not represented).
Emits exactly: StartDocument, StartParagraph, Text, LineBreak,
EndParagraph, StartTable, StartTableRow, StartTableCell,
EndTableCell, EndTableRow, EndTable, EndDocument.
Out of scope (silently dropped):
- Run styling (
<w:rPr>, bold, italic, etc.) - Headings (any
<w:pStyle>value — every paragraph isStartParagraph) - Cell merging (
<w:gridSpan>,<w:vMerge>) — every cell emits withcolspan: Noneandrowspan: None - Header rows (
<w:tblHeader>) — every cell emits asStartTableCell, neverStartTableHeader - Table, row, and cell properties (
<w:tblPr>,<w:trPr>,<w:tcPr>,<w:tblGrid>) - Lists
- Hyperlinks (
<w:hyperlink>) - Drawings and images (
<w:drawing>,<w:pict>) - Structured document tags (
<w:sdt>) - Comments, footnotes, headers, footers
- Document metadata
- Tracked changes (
<w:ins>,<w:del>,<w:moveFrom>,<w:moveTo>)
§Streaming Guarantee
DocxReader streams document.xml event by event using constant memory
regardless of document size. Only _rels/.rels (a few hundred bytes) is
fully read into memory to discover the document target path.
§Quick Start
use docspec_docx_reader::{DocxReader, EventSource};
let mut reader = DocxReader::from_path("document.docx")?;
while let Some(event) = reader.next_event()? {
println!("{event:?}");
}Structs§
- Docx
Reader - A streaming DOCX reader that implements
EventSource.
Traits§
- Event
Source - Produces a stream of
crate::Events from a document source.