docspec-docx-reader
Streaming DOCX to DocSpec event stream reader.
See the main DocSpec repository for documentation, architecture, and the event protocol.
Supported
- Paragraphs (
<w:p>) and direct text (<w:t>inside<w:r>) - Line breaks (
<w:br>, includingw:type="page"andw:type="column"— all emitLineBreak) - Tabs (
<w:tab>— emitted as aTextevent containing the single character"\t") - Tables (
<w:tbl>,<w:tr>,<w:tc>) — structural events. Horizontal cell merging via<w:gridSpan>is emitted ascolspan. Header rows via<w:trPr><w:tblHeader/></w:trPr>are emitted asStartTableHeader(withscope: Column) for cells in the contiguous header band at the top of the outermost table. Nested tables emit all cells asStartTableCell(header rows in nested tables are not supported). Vertical merging via<w:vMerge>is NOT yet supported — rowspan information is lost. - Run properties (
<w:rPr>):<w:b>(bold),<w:i>(italic),<w:u>(underline — anyw:valother thannone),<w:strike>(strikethrough),<w:dstrike>(double-strike, collapses to strikethrough),<w:vertAlign>(subscriptandsuperscriptonly;baselineresets to neither). These are emitted as deferredStartTextStyle { kind, id: None }/EndTextStylewrapper events around the first run content, not as fields onTextevents. Empty styled runs emit no style wrapper events; multiple<w:t>elements in one styled run share a single wrapper span. - Run color properties (
<w:rPr>):<w:color w:val="HEX">— foreground (text) color. Emitted asStartTextStyle { kind: TextColor(Color::Rgb { r, g, b }) }.w:val="auto"and non-hex values are silently dropped. Black(0,0,0)is preserved by the reader; whether to treat it as "default color" is writer policy.<w:highlight w:val="namedColor">— highlight color using the 17-entry ECMA-376 named palette. Emitted asStartTextStyle { kind: Mark(Color::Rgb { r, g, b }) }.w:val="none"and unknown names are silently dropped.<w:shd w:fill="HEX">— background fill, used as a fallback highlight when<w:highlight>is absent. Emitted asStartTextStyle { kind: Mark(Color::Rgb { r, g, b }) }.w:fill="auto"and a missingw:fillattribute are silently dropped.
- Paragraph properties (
<w:pPr>):<w:jc>(alignment —left/startto Left,right/endto Right,centerto Center,both/distributeto Justify) - Lists (
<w:p>with<w:numPr>): ordered (Decimal/LowerAlpha/UpperAlpha/LowerRoman/UpperRoman) and unordered (Disc) — emitted asStart*ListItem/End*ListItemwithid(numId stringified),level(ilvl),start(Some(1)on first item per numId), andstyle_type. Per-level classification: same numId can mix ordered and unordered levels. Continuation paragraphs — paragraphs without<w:numPr>between list items — attach to the preceding open list item as additionalStartParagraph/EndParagraphcontent inside the still-openStart*ListItem. The list closes only at a heading, block quote, preformatted block, table boundary, table-cell boundary, or end of document. - Empty
<w:rPr/>and<w:pPr/>are treated as no properties (default style / alignment None) - A
<w:rPr>or<w:pPr>that appears after content in the same parent is silently ignored (per the OOXML spec, both must be the first child element) - Hyperlinks (
<w:hyperlink>): resolved viaword/_rels/document.xml.relsand emitted asStartLink/EndLinkevents around inline content. Supports external URL targets (both Strict and Transitional OOXML relationship Type URIs), anchor-only links (w:anchorwithoutr:idemits#fragment), and tooltips (w:tooltip→StartLink.title, XML-decoded). When the relationship cannot be resolved, the link wrapper is dropped and content passes through as plain runs. - Structured Document Tags (
<w:sdt>) — the content of an SDT is emitted normally. The property containers<w:sdtPr>and<w:sdtEndPr>are dropped. - Tracked insertions and moves (
<w:ins>,<w:moveTo>) — the inserted/moved-in content is emitted (accept-changes semantics). - DrawingML images (
<w:drawing>) — emitted asEvent::Image. See Image Support below. - Emits:
StartDocument,StartParagraph,StartTextStyle,Text,EndTextStyle,LineBreak,EndParagraph,StartTable,StartTableRow,StartTableCell,StartTableHeader,EndTableHeader,EndTableCell,EndTableRow,EndTable,StartLink,EndLink,StartOrderedListItem,EndOrderedListItem,StartUnorderedListItem,EndUnorderedListItem,Image,EndDocument - Symbol font character normalization for Wingdings, Wingdings 2, Wingdings 3, Webdings, and Symbol fonts — codepoints are mapped to their Unicode equivalents; unmapped codepoints are dropped
- Compression:
StoredandDeflatedonly
Color and Highlight Precedence
When both <w:highlight> and <w:shd w:fill> appear in the same <w:rPr>, <w:highlight> wins. The <w:shd> fill is ignored for that run.
No-Collapse Rule
Adjacent runs with the same color emit separate StartTextStyle/EndTextStyle pairs. The reader maintains per-run discipline and does not merge consecutive runs, even when their style properties are identical.
Out of Scope (subtree silently dropped)
The XML elements listed below are the reader's denylist — their entire subtree is silently dropped during parsing. Any element NOT listed (whether a known structural tag like <w:p> or an unknown extension) is processed normally; the reader just continues into its children.
- Headings (any
<w:pStyle>value — every paragraph isStartParagraph) - Style references (
<w:rStyle>,<w:pStyle>) - Run formatting not listed above:
<w:sz>,<w:szCs>,<w:caps>,<w:smallCaps>,<w:position>,<w:spacing>,<w:kern>,<w:lang>,<w:noProof> <w:rFonts>(general font tracking is not exposed as events, except for symbol font resolution (Wingdings, Wingdings 2, Wingdings 3, Webdings, Symbol) which is used internally to normalize codepoints to Unicode)themeColor/themeTint/themeShadeattributes on<w:color>and<w:shd>— silently dropped. The reader does not parsestyles.xmlortheme1.xml, so theme-referenced colors cannot be resolved. Future work.- Revision tracking (
<w:rPrChange>,<w:pPrChange>) - Advanced paragraph layout beyond alignment:
<w:ind>,<w:tabs>,<w:framePr>,<w:sectPr> <w:rPr>nested inside<w:pPr>(paragraph mark / pilcrow run properties)- BiDi-aware logical alignment (
start/endflipping based on paragraph direction is not tracked) - Math (
m:rPr) and DrawingML (a:rPr) namespaces - Vertical cell merging (
<w:vMerge>) — every cell still emits withrowspan: None; covered cells emit as ordinaryStartTableCellevents (visual merge is lost) - Header rows in nested tables — only the outermost table honors
<w:tblHeader> - Table-level property exceptions (
<w:tblPrEx>) — silently ignored (consistent with<w:tblPr>) - Table, row, and cell visual properties (
<w:tblPr>,<w:trPr>visual fields,<w:tcPr>visual fields,<w:tblGrid>) — silently dropped - VML images (
<w:pict>) — deferred to follow-up; subtree silently dropped - Comments, footnotes, headers, footers
- Document metadata
- Tracked deletions and moves-from (
<w:del>,<w:moveFrom>) — silently dropped (accept-changes semantics). Their text content uses<w:delText>which is not part of the reader's text-matching set. - Structured document tag properties (
<w:sdtPr>,<w:sdtEndPr>) — metadata containers; subtree dropped. - Field-code hyperlinks (
<w:fldChar>+<w:instrText>HYPERLINK ...): legacy form not currently supported; only the modern<w:hyperlink>element is recognized.
Lists (V1 cuts)
The following list features are intentionally out of scope for V1:
- No
<w:start>element parsing —startis alwaysSome(1)on the first item of each list,Nonethereafter - No
<w:lvlOverride>resolution — abstractNum's level definitions are authoritative - No picture bullets (
<w:lvlPicBulletId>) — picture-bullet levels emitDisc - No style-linked lists (
<w:numStyleLink>,<w:styleLink>) — fall back toDecimaldefaults <w:multiLevelType>is ignored — per-level<w:numFmt>is authoritative (§17.9.12)- No per-level marker text (
<w:lvlText>) — not parsed - No level-specific font, color, or indent
- No per-level resolution for synthesised phantom levels — when the first authored item appears at
ilvl > 0, intermediate levels inherit the target item's ordered/unordered classification and useDecimal/Discdefaults instead of resolving each phantom level's ownnumFmt
Image Support
<w:drawing> elements are parsed and emit Event::Image. The source field is one of two variants:
- Embedded image (
r:embed):ImageSource::Asset { asset_id: "zip://word/media/image1.png" }. Theasset_idis the resolved ZIP entry path with azip://scheme prefix. Use [DocxAssetProvider] to stream the raw bytes. - External image (
r:link):ImageSource::Uri { uri: "<url>" }. The URL is passed through verbatim from the relationship target.
When both r:embed and r:link appear on the same blip, r:embed wins.
A relationship marked TargetMode="External" is honored even when referenced via r:embed — the reader emits ImageSource::Uri in that case, matching Word and LibreOffice behavior.
If the relationship ID cannot be resolved (missing or malformed rels), the reader emits ImageSource::Asset { asset_id: "<rId>" } using the raw relationship ID with no zip:// prefix. Writers should apply their own missing-asset policy.
wp:docPr/@descr maps to Event::Image.alt. The title field is always None in this release.
VML images (<w:pict>) are not supported in this release — their subtree is silently dropped.
DocxAssetProvider
[DocxAssetProvider] implements the AssetProvider trait and streams asset bytes from the DOCX ZIP archive on demand. Open it independently from [DocxReader] using the same file path or an in-memory buffer.
use ;
use ;
let mut reader = from_path?;
let provider = from_path?;
while let Some = reader.next_event?
# Ok::
For in-memory DOCX data, use DocxAssetProvider::from_reader with a Cursor<Vec<u8>>.
Streaming Guarantee
DocxReader streams document.xml event by event using constant memory regardless
of document size. _rels/.rels and word/_rels/document.xml.rels are both fully
read into memory at package-open time (typical combined size < 10 KB even for large
documents). word/document.xml is consumed in streaming fashion via quick-xml.
The internal event queue remains bounded regardless of document size or hyperlink count.
Quick Start
use ;
let mut reader = from_path?;
while let Some = reader.next_event?
# Ok::
See Also
- MANIFESTO.md — philosophy and values
- ARCHITECTURE.md — pipeline design, event model decisions, and pointers to the in-code event reference
docspec_coreon docs.rs — every event variant, field, and well-formedness rule