docspec-docx-reader 1.10.0

DOCX to DocSpec event stream reader
Documentation

docspec-docx-reader

Streaming DOCX to DocSpec event stream reader.

See the main DocSpec repository for documentation, architecture, and the event protocol.

Supported

  • Paragraphs (<w:p>) and direct text (<w:t> inside <w:r>)
  • Line breaks (<w:br>, including w:type="page" and w:type="column" — all emit LineBreak)
  • Tabs (<w:tab> — emitted as a Text event containing the single character "\t")
  • Tables (<w:tbl>, <w:tr>, <w:tc>) — structural events. Horizontal cell merging via <w:gridSpan> is emitted as colspan. Header rows via <w:trPr><w:tblHeader/></w:trPr> are emitted as StartTableHeader (with scope: Column) for cells in the contiguous header band at the top of the outermost table. Nested tables emit all cells as StartTableCell (header rows in nested tables are not supported). Vertical merging via <w:vMerge> is NOT yet supported — rowspan information is lost.
  • Run properties (<w:rPr>): <w:b> (bold), <w:i> (italic), <w:u> (underline — any w:val other than none), <w:strike> (strikethrough), <w:dstrike> (double-strike, collapses to strikethrough), <w:vertAlign> (subscript and superscript only; baseline resets to neither). These are emitted as deferred StartTextStyle { kind, id: None } / EndTextStyle wrapper events around the first run content, not as fields on Text events. Empty styled runs emit no style wrapper events; multiple <w:t> elements in one styled run share a single wrapper span.
  • Run color properties (<w:rPr>):
    • <w:color w:val="HEX"> — foreground (text) color. Emitted as StartTextStyle { kind: TextColor(Color::Rgb { r, g, b }) }. w:val="auto" and non-hex values are silently dropped. Black (0,0,0) is preserved by the reader; whether to treat it as "default color" is writer policy.
    • <w:highlight w:val="namedColor"> — highlight color using the 17-entry ECMA-376 named palette. Emitted as StartTextStyle { kind: Mark(Color::Rgb { r, g, b }) }. w:val="none" and unknown names are silently dropped.
    • <w:shd w:fill="HEX"> — background fill, used as a fallback highlight when <w:highlight> is absent. Emitted as StartTextStyle { kind: Mark(Color::Rgb { r, g, b }) }. w:fill="auto" and a missing w:fill attribute are silently dropped.
  • Paragraph properties (<w:pPr>): <w:jc> (alignment — left/start to Left, right/end to Right, center to Center, both/distribute to Justify)
  • Lists (<w:p> with <w:numPr>): ordered (Decimal/LowerAlpha/UpperAlpha/LowerRoman/UpperRoman) and unordered (Disc) — emitted as Start*ListItem/End*ListItem with id (numId stringified), level (ilvl), start (Some(1) on first item per numId), and style_type. Per-level classification: same numId can mix ordered and unordered levels. Continuation paragraphs — paragraphs without <w:numPr> between list items — attach to the preceding open list item as additional StartParagraph/EndParagraph content inside the still-open Start*ListItem. The list closes only at a heading, block quote, preformatted block, table boundary, table-cell boundary, or end of document.
  • Empty <w:rPr/> and <w:pPr/> are treated as no properties (default style / alignment None)
  • A <w:rPr> or <w:pPr> that appears after content in the same parent is silently ignored (per the OOXML spec, both must be the first child element)
  • Hyperlinks (<w:hyperlink>): resolved via word/_rels/document.xml.rels and emitted as StartLink/EndLink events around inline content. Supports external URL targets (both Strict and Transitional OOXML relationship Type URIs), anchor-only links (w:anchor without r:id emits #fragment), and tooltips (w:tooltipStartLink.title, XML-decoded). When the relationship cannot be resolved, the link wrapper is dropped and content passes through as plain runs.
  • Structured Document Tags (<w:sdt>) — the content of an SDT is emitted normally. The property containers <w:sdtPr> and <w:sdtEndPr> are dropped.
  • Tracked insertions and moves (<w:ins>, <w:moveTo>) — the inserted/moved-in content is emitted (accept-changes semantics).
  • DrawingML images (<w:drawing>) — emitted as Event::Image. See Image Support below.
  • Emits: StartDocument, StartParagraph, StartTextStyle, Text, EndTextStyle, LineBreak, EndParagraph, StartTable, StartTableRow, StartTableCell, StartTableHeader, EndTableHeader, EndTableCell, EndTableRow, EndTable, StartLink, EndLink, StartOrderedListItem, EndOrderedListItem, StartUnorderedListItem, EndUnorderedListItem, Image, EndDocument
  • Symbol font character normalization for Wingdings, Wingdings 2, Wingdings 3, Webdings, and Symbol fonts — codepoints are mapped to their Unicode equivalents; unmapped codepoints are dropped
  • Compression: Stored and Deflated only

Color and Highlight Precedence

When both <w:highlight> and <w:shd w:fill> appear in the same <w:rPr>, <w:highlight> wins. The <w:shd> fill is ignored for that run.

No-Collapse Rule

Adjacent runs with the same color emit separate StartTextStyle/EndTextStyle pairs. The reader maintains per-run discipline and does not merge consecutive runs, even when their style properties are identical.

Out of Scope (subtree silently dropped)

The XML elements listed below are the reader's denylist — their entire subtree is silently dropped during parsing. Any element NOT listed (whether a known structural tag like <w:p> or an unknown extension) is processed normally; the reader just continues into its children.

  • Headings (any <w:pStyle> value — every paragraph is StartParagraph)
  • Style references (<w:rStyle>, <w:pStyle>)
  • Run formatting not listed above: <w:sz>, <w:szCs>, <w:caps>, <w:smallCaps>, <w:position>, <w:spacing>, <w:kern>, <w:lang>, <w:noProof>
  • <w:rFonts> (general font tracking is not exposed as events, except for symbol font resolution (Wingdings, Wingdings 2, Wingdings 3, Webdings, Symbol) which is used internally to normalize codepoints to Unicode)
  • themeColor / themeTint / themeShade attributes on <w:color> and <w:shd> — silently dropped. The reader does not parse styles.xml or theme1.xml, so theme-referenced colors cannot be resolved. Future work.
  • Revision tracking (<w:rPrChange>, <w:pPrChange>)
  • Advanced paragraph layout beyond alignment: <w:ind>, <w:tabs>, <w:framePr>, <w:sectPr>
  • <w:rPr> nested inside <w:pPr> (paragraph mark / pilcrow run properties)
  • BiDi-aware logical alignment (start/end flipping based on paragraph direction is not tracked)
  • Math (m:rPr) and DrawingML (a:rPr) namespaces
  • Vertical cell merging (<w:vMerge>) — every cell still emits with rowspan: None; covered cells emit as ordinary StartTableCell events (visual merge is lost)
  • Header rows in nested tables — only the outermost table honors <w:tblHeader>
  • Table-level property exceptions (<w:tblPrEx>) — silently ignored (consistent with <w:tblPr>)
  • Table, row, and cell visual properties (<w:tblPr>, <w:trPr> visual fields, <w:tcPr> visual fields, <w:tblGrid>) — silently dropped
  • VML images (<w:pict>) — deferred to follow-up; subtree silently dropped
  • Comments, footnotes, headers, footers
  • Document metadata
  • Tracked deletions and moves-from (<w:del>, <w:moveFrom>) — silently dropped (accept-changes semantics). Their text content uses <w:delText> which is not part of the reader's text-matching set.
  • Structured document tag properties (<w:sdtPr>, <w:sdtEndPr>) — metadata containers; subtree dropped.
  • Field-code hyperlinks (<w:fldChar> + <w:instrText>HYPERLINK ...): legacy form not currently supported; only the modern <w:hyperlink> element is recognized.

Lists (V1 cuts)

The following list features are intentionally out of scope for V1:

  • No <w:start> element parsing — start is always Some(1) on the first item of each list, None thereafter
  • No <w:lvlOverride> resolution — abstractNum's level definitions are authoritative
  • No picture bullets (<w:lvlPicBulletId>) — picture-bullet levels emit Disc
  • No style-linked lists (<w:numStyleLink>, <w:styleLink>) — fall back to Decimal defaults
  • <w:multiLevelType> is ignored — per-level <w:numFmt> is authoritative (§17.9.12)
  • No per-level marker text (<w:lvlText>) — not parsed
  • No level-specific font, color, or indent
  • No per-level resolution for synthesised phantom levels — when the first authored item appears at ilvl > 0, intermediate levels inherit the target item's ordered/unordered classification and use Decimal/Disc defaults instead of resolving each phantom level's own numFmt

Image Support

<w:drawing> elements are parsed and emit Event::Image. The source field is one of two variants:

  • Embedded image (r:embed): ImageSource::Asset { asset_id: "zip://word/media/image1.png" }. The asset_id is the resolved ZIP entry path with a zip:// scheme prefix. Use [DocxAssetProvider] to stream the raw bytes.
  • External image (r:link): ImageSource::Uri { uri: "<url>" }. The URL is passed through verbatim from the relationship target.

When both r:embed and r:link appear on the same blip, r:embed wins.

A relationship marked TargetMode="External" is honored even when referenced via r:embed — the reader emits ImageSource::Uri in that case, matching Word and LibreOffice behavior.

If the relationship ID cannot be resolved (missing or malformed rels), the reader emits ImageSource::Asset { asset_id: "<rId>" } using the raw relationship ID with no zip:// prefix. Writers should apply their own missing-asset policy.

wp:docPr/@descr maps to Event::Image.alt. The title field is always None in this release.

VML images (<w:pict>) are not supported in this release — their subtree is silently dropped.

DocxAssetProvider

[DocxAssetProvider] implements the AssetProvider trait and streams asset bytes from the DOCX ZIP archive on demand. Open it independently from [DocxReader] using the same file path or an in-memory buffer.

use docspec_docx_reader::{DocxReader, DocxAssetProvider, EventSource};
use docspec_core::{AssetProvider, Event, ImageSource};

let mut reader = DocxReader::from_path("document.docx")?;
let provider = DocxAssetProvider::from_path("document.docx")?;

while let Some(event) = reader.next_event()? {
    if let Event::Image { source: ImageSource::Asset { asset_id }, .. } = &event {
        let mut buf = Vec::new();
        if let Some(result) = provider.stream_to(asset_id, &mut buf) {
            result?;
        }
        // buf now contains the raw image bytes (or is empty if the asset was missing)
    }
}
# Ok::<(), docspec_core::Error>(())

For in-memory DOCX data, use DocxAssetProvider::from_reader with a Cursor<Vec<u8>>.

Streaming Guarantee

DocxReader streams document.xml event by event using constant memory regardless of document size. _rels/.rels and word/_rels/document.xml.rels are both fully read into memory at package-open time (typical combined size < 10 KB even for large documents). word/document.xml is consumed in streaming fashion via quick-xml. The internal event queue remains bounded regardless of document size or hyperlink count.

Quick Start

use docspec_docx_reader::{DocxReader, EventSource};

let mut reader = DocxReader::from_path("document.docx")?;
while let Some(event) = reader.next_event()? {
    println!("{event:?}");
}
# Ok::<(), docspec_core::Error>(())

See Also