Expand description
DOCX to DocSpec event stream reader.
This crate provides a DocxReader that implements EventSource to convert
DOCX documents into the DocSpec event stream format. It uses quick-xml for
streaming XML parsing and zip for archive extraction.
§Scope
In scope: Paragraphs (<w:p>) and direct text (<w:t> inside <w:r>).
Emits exactly: StartDocument, StartParagraph, Text, EndParagraph, EndDocument.
Out of scope (silently dropped):
- Run styling (
<w:rPr>, bold, italic, etc.) - Line and page breaks (
<w:br>) - Tabs (
<w:tab>) - Headings (any
<w:pStyle>value — every paragraph isStartParagraph) - Tables (
<w:tbl>,<w:tr>,<w:tc>) - Lists
- Hyperlinks (
<w:hyperlink>) - Drawings and images (
<w:drawing>,<w:pict>) - Structured document tags (
<w:sdt>) - Comments, footnotes, headers, footers
- Document metadata
- Tracked changes (
<w:ins>,<w:del>,<w:moveFrom>,<w:moveTo>)
§Streaming Guarantee
DocxReader streams document.xml event by event using constant memory
regardless of document size. Only _rels/.rels (a few hundred bytes) is
fully read into memory to discover the document target path.
§Quick Start
use docspec_docx_reader::{DocxReader, EventSource};
let mut reader = DocxReader::from_path("document.docx")?;
while let Some(event) = reader.next_event()? {
println!("{event:?}");
}Structs§
- Docx
Reader - A streaming DOCX reader that implements
EventSource.
Traits§
- Event
Source - Produces a stream of
crate::Events from a document source.