officemd_docx 0.1.4

DOCX extraction for OfficeMD
Documentation

officemd_docx

DOCX extraction helpers built on officemd_core. This crate streams WordprocessingML parts to build the shared IR.

Features

  • Extracts body, headers, footers, footnotes, endnotes into DocSection blocks.
  • Resolves hyperlinks via .rels and a best-effort field-code fallback.
  • Parses comments and inserts inline anchors plus per-section footnotes.
  • Collects document properties from docProps/*.

Rust Usage

use officemd_docx::extract_ir;

let doc = extract_ir(docx_bytes)?;

Cargo Example

cargo run -p officemd_examples --bin extract_ir_docx -- path/to/document.docx

Python usage

from officemd import extract_ir_json, markdown_from_bytes

doc_json = extract_ir_json(docx_bytes)
markdown = markdown_from_bytes(docx_bytes, format="docx")
markdown_with_props = markdown_from_bytes(
    docx_bytes,
    format="docx",
    include_document_properties=True,
)

Tests

# Rust tests
cargo test -p officemd_docx

# Python bindings (from dedicated adapter crate)
cd crates/officemd_python && uv run maturin develop --release

Fixture note:

  • tests/fixtures/basic.docx is the crate-local sample fixture path (fixture-based tests skip when missing).
  • Real-world fixtures can be placed in tests/data/*.docx (crate local) or ../../tests/data/*.docx (repo root).