undocx

Fast, accurate DOCX to Markdown converter built for LLM/RAG pipelines. Written in Rust with Python bindings.

16.5x faster than pandoc — 3.3ms per file average
LLM-optimized — Clean Markdown output ready for embeddings, chunking, and retrieval
Full fidelity — Tables, footnotes, track changes, images, nested lists, and more

For Humans • For Agents • Benchmarks • Features • Contributing

Conversion Demo

Click images to see full GitHub-rendered files.

Benchmarks

Measured on 39 DOCX files × 10 iterations (reproduce it yourself):

Tool	Avg (ms)	Median (ms)	Min (ms)	Max (ms)
undocx	3.34	3.22	2.89	5.46
markitdown	18.25	17.45	14.63	41.81
pandoc	55.08	54.11	40.31	69.51

undocx is 16.5x faster than pandoc and 5.5x faster than markitdown.

Feature	undocx	pandoc	markitdown
Language	Rust	Haskell	Python
Speed (avg)	3.3ms/file	55ms/file	18ms/file
Tables (colspan/rowspan)	Yes	Partial	Yes
Track changes	Yes	Yes	No
Footnotes/Endnotes	Yes	Yes	No
Comments	Yes	No	No
VML legacy images	Yes	No	No
Korean numbering	Yes	No	No
Python API	Yes	CLI only	Yes
Rust API	Yes	No	No

For Humans

Install and convert — that's it.

pip install undocx          # Python
cargo install undocx        # CLI

CLI

undocx report.docx output.md              # convert to file
undocx report.docx                         # print to stdout
undocx report.docx -o out.md --images-dir ./img  # extract images

Python

import undocx

markdown = undocx.convert_docx("report.docx")

For Agents

Designed for document preprocessing in LLM/RAG pipelines.

Python — RAG ingestion

import undocx

# Skip images for text-only RAG ingestion
md = undocx.convert_docx("report.docx", image_handling="skip")

# Process bytes from S3, HTTP, or any byte stream
md = undocx.convert_docx(doc_bytes, image_handling="skip")

Rust — One-liner

let md = undocx::convert("report.docx")?;
let md = undocx::convert_bytes(&bytes)?;

Rust — Builder (optimal for RAG)

let md = undocx::builder()
    .skip_images()
    .convert("report.docx")?;

Rust — Pluggable architecture

let converter = DocxToMarkdown::with_components(
    ConvertOptions::default(),
    MyExtractor,    // impl AstExtractor
    MyRenderer,     // impl Renderer
);

See docs/API_POLICY.md for stability guarantees on these traits.

# Cargo.toml
[dependencies]
undocx = "0.4"

Tips for RAG pipelines:

Use image_handling="skip" to reduce token count
Output is clean Markdown — split on ## headers for semantic chunking
Footnotes and comments are preserved as [^ref] for full context

Supported Features

Category	Elements
Text	Bold, italic, underline, strikethrough, superscript/subscript
Structure	Heading 1-9, Title, Subtitle, alignment (center/right)
Lists	Ordered (decimal, letter, roman, Korean, circled), unordered, nested
Tables	Colspan, rowspan, nested tables, multi-paragraph cells
Links	External, internal bookmarks, TOC anchors
Images	Inline, floating, VML legacy — base64 embed, save to dir, or skip
Notes	Footnotes, endnotes, comments (as Markdown `[^ref]`)
Track changes	Insertions (`<ins>`), deletions (`~~strikethrough~~`)
Other	Page/column/line breaks, SDT, field codes, bookmarks, symbols

Options

Field	Default	Description
`image_handling`	`Inline`	`Inline` / `SaveToDir(path)` / `Skip`
`preserve_whitespace`	`false`	Keep original spacing
`html_underline`	`true`	`<u>` tags for underline
`html_strikethrough`	`false`	`<s>` tags instead of `~~`
`strict_reference_validation`	`false`	Fail on broken note/comment refs

Development

cargo test --all-features                                  # test
cargo clippy --all-features --tests -- -D warnings         # lint
python examples/benchmark_comparison.py ./tests/pandoc 10  # bench

See CONTRIBUTING.md for development setup and guidelines.

License

MIT — see LICENSE

undocx 0.5.2