undocx
Fast, accurate DOCX to Markdown converter built for LLM/RAG pipelines. Written in Rust with Python bindings.
- 16.5x faster than pandoc — 3.3ms per file average
- LLM-optimized — Clean Markdown output ready for embeddings, chunking, and retrieval
- Full fidelity — Tables, footnotes, track changes, images, nested lists, and more
For Humans • For Agents • Benchmarks • Features • Contributing
Conversion Demo
Click images to see full GitHub-rendered files.
Benchmarks
Measured on 39 DOCX files × 10 iterations (reproduce it yourself):
| Tool | Avg (ms) | Median (ms) | Min (ms) | Max (ms) |
|---|---|---|---|---|
| undocx | 3.34 | 3.22 | 2.89 | 5.46 |
| markitdown | 18.25 | 17.45 | 14.63 | 41.81 |
| pandoc | 55.08 | 54.11 | 40.31 | 69.51 |
undocx is 16.5x faster than pandoc and 5.5x faster than markitdown.
| Feature | undocx | pandoc | markitdown |
|---|---|---|---|
| Language | Rust | Haskell | Python |
| Speed (avg) | 3.3ms/file | 55ms/file | 18ms/file |
| Tables (colspan/rowspan) | Yes | Partial | Yes |
| Track changes | Yes | Yes | No |
| Footnotes/Endnotes | Yes | Yes | No |
| Comments | Yes | No | No |
| VML legacy images | Yes | No | No |
| Korean numbering | Yes | No | No |
| Python API | Yes | CLI only | Yes |
| Rust API | Yes | No | No |
For Humans
Install and convert — that's it.
CLI
Python
=
For Agents
Designed for document preprocessing in LLM/RAG pipelines.
Python — RAG ingestion
# Skip images for text-only RAG ingestion
=
# Process bytes from S3, HTTP, or any byte stream
=
Rust — One-liner
let md = convert?;
let md = convert_bytes?;
Rust — Builder (optimal for RAG)
let md = builder
.skip_images
.convert?;
Rust — Pluggable architecture
let converter = with_components;
See docs/API_POLICY.md for stability guarantees on these traits.
# Cargo.toml
[]
= "0.4"
Tips for RAG pipelines:
- Use
image_handling="skip"to reduce token count - Output is clean Markdown — split on
##headers for semantic chunking - Footnotes and comments are preserved as
[^ref]for full context
Supported Features
| Category | Elements |
|---|---|
| Text | Bold, italic, underline, strikethrough, superscript/subscript |
| Structure | Heading 1-9, Title, Subtitle, alignment (center/right) |
| Lists | Ordered (decimal, letter, roman, Korean, circled), unordered, nested |
| Tables | Colspan, rowspan, nested tables, multi-paragraph cells |
| Links | External, internal bookmarks, TOC anchors |
| Images | Inline, floating, VML legacy — base64 embed, save to dir, or skip |
| Notes | Footnotes, endnotes, comments (as Markdown [^ref]) |
| Track changes | Insertions (<ins>), deletions (~~strikethrough~~) |
| Other | Page/column/line breaks, SDT, field codes, bookmarks, symbols |
Options
| Field | Default | Description |
|---|---|---|
image_handling |
Inline |
Inline / SaveToDir(path) / Skip |
preserve_whitespace |
false |
Keep original spacing |
html_underline |
true |
<u> tags for underline |
html_strikethrough |
false |
<s> tags instead of ~~ |
strict_reference_validation |
false |
Fail on broken note/comment refs |
Development
See CONTRIBUTING.md for development setup and guidelines.
License
MIT — see LICENSE