undocx 0.5.2

DOCX to Markdown converter written in Rust
Documentation

undocx

Crates.io PyPI docs.rs License: MIT

Fast, accurate DOCX to Markdown converter built for LLM/RAG pipelines. Written in Rust with Python bindings.

  • 16.5x faster than pandoc — 3.3ms per file average
  • LLM-optimized — Clean Markdown output ready for embeddings, chunking, and retrieval
  • Full fidelity — Tables, footnotes, track changes, images, nested lists, and more

For HumansFor AgentsBenchmarksFeaturesContributing


Conversion Demo

Click images to see full GitHub-rendered files.

Benchmarks

Measured on 39 DOCX files × 10 iterations (reproduce it yourself):

Tool Avg (ms) Median (ms) Min (ms) Max (ms)
undocx 3.34 3.22 2.89 5.46
markitdown 18.25 17.45 14.63 41.81
pandoc 55.08 54.11 40.31 69.51

undocx is 16.5x faster than pandoc and 5.5x faster than markitdown.

Feature undocx pandoc markitdown
Language Rust Haskell Python
Speed (avg) 3.3ms/file 55ms/file 18ms/file
Tables (colspan/rowspan) Yes Partial Yes
Track changes Yes Yes No
Footnotes/Endnotes Yes Yes No
Comments Yes No No
VML legacy images Yes No No
Korean numbering Yes No No
Python API Yes CLI only Yes
Rust API Yes No No

For Humans

Install and convert — that's it.

pip install undocx          # Python
cargo install undocx        # CLI

CLI

undocx report.docx output.md              # convert to file
undocx report.docx                         # print to stdout
undocx report.docx -o out.md --images-dir ./img  # extract images

Python

import undocx

markdown = undocx.convert_docx("report.docx")

For Agents

Designed for document preprocessing in LLM/RAG pipelines.

Python — RAG ingestion

import undocx

# Skip images for text-only RAG ingestion
md = undocx.convert_docx("report.docx", image_handling="skip")

# Process bytes from S3, HTTP, or any byte stream
md = undocx.convert_docx(doc_bytes, image_handling="skip")

Rust — One-liner

let md = undocx::convert("report.docx")?;
let md = undocx::convert_bytes(&bytes)?;

Rust — Builder (optimal for RAG)

let md = undocx::builder()
    .skip_images()
    .convert("report.docx")?;

Rust — Pluggable architecture

let converter = DocxToMarkdown::with_components(
    ConvertOptions::default(),
    MyExtractor,    // impl AstExtractor
    MyRenderer,     // impl Renderer
);

See docs/API_POLICY.md for stability guarantees on these traits.

# Cargo.toml
[dependencies]
undocx = "0.4"

Tips for RAG pipelines:

  • Use image_handling="skip" to reduce token count
  • Output is clean Markdown — split on ## headers for semantic chunking
  • Footnotes and comments are preserved as [^ref] for full context

Supported Features

Category Elements
Text Bold, italic, underline, strikethrough, superscript/subscript
Structure Heading 1-9, Title, Subtitle, alignment (center/right)
Lists Ordered (decimal, letter, roman, Korean, circled), unordered, nested
Tables Colspan, rowspan, nested tables, multi-paragraph cells
Links External, internal bookmarks, TOC anchors
Images Inline, floating, VML legacy — base64 embed, save to dir, or skip
Notes Footnotes, endnotes, comments (as Markdown [^ref])
Track changes Insertions (<ins>), deletions (~~strikethrough~~)
Other Page/column/line breaks, SDT, field codes, bookmarks, symbols

Options

Field Default Description
image_handling Inline Inline / SaveToDir(path) / Skip
preserve_whitespace false Keep original spacing
html_underline true <u> tags for underline
html_strikethrough false <s> tags instead of ~~
strict_reference_validation false Fail on broken note/comment refs

Development

cargo test --all-features                                  # test
cargo clippy --all-features --tests -- -D warnings         # lint
python examples/benchmark_comparison.py ./tests/pandoc 10  # bench

See CONTRIBUTING.md for development setup and guidelines.

License

MIT — see LICENSE