Skip to main content

Crate docspec_docx_reader

Crate docspec_docx_reader 

Source
Expand description

DOCX to DocSpec event stream reader.

This crate provides a DocxReader that implements EventSource to convert DOCX documents into the DocSpec event stream format. It uses quick-xml for streaming XML parsing and zip for archive extraction.

§Scope

In scope: Paragraphs (<w:p>) and direct text (<w:t> inside <w:r>). Emits exactly: StartDocument, StartParagraph, Text, EndParagraph, EndDocument.

Out of scope (silently dropped):

  • Run styling (<w:rPr>, bold, italic, etc.)
  • Line and page breaks (<w:br>)
  • Tabs (<w:tab>)
  • Headings (any <w:pStyle> value — every paragraph is StartParagraph)
  • Tables (<w:tbl>, <w:tr>, <w:tc>)
  • Lists
  • Hyperlinks (<w:hyperlink>)
  • Drawings and images (<w:drawing>, <w:pict>)
  • Structured document tags (<w:sdt>)
  • Comments, footnotes, headers, footers
  • Document metadata
  • Tracked changes (<w:ins>, <w:del>, <w:moveFrom>, <w:moveTo>)

§Streaming Guarantee

DocxReader streams document.xml event by event using constant memory regardless of document size. Only _rels/.rels (a few hundred bytes) is fully read into memory to discover the document target path.

§Quick Start

use docspec_docx_reader::{DocxReader, EventSource};

let mut reader = DocxReader::from_path("document.docx")?;
while let Some(event) = reader.next_event()? {
    println!("{event:?}");
}

Structs§

DocxReader
A streaming DOCX reader that implements EventSource.

Traits§

EventSource
Produces a stream of crate::Events from a document source.