# undocx
[](https://crates.io/crates/undocx)
[](https://pypi.org/project/undocx/)
[](https://docs.rs/undocx)
[](https://opensource.org/licenses/MIT)
**Fast, accurate DOCX to Markdown converter built for LLM/RAG pipelines.** Written in Rust with Python bindings.
- **16.5x faster than pandoc** — 3.3ms per file average
- **LLM-optimized** — Clean Markdown output ready for embeddings, chunking, and retrieval
- **Full fidelity** — Tables, footnotes, track changes, images, nested lists, and more
[For Humans](#for-humans) • [For Agents](#for-agents) • [Benchmarks](#benchmarks) • [Features](#supported-features) • [Contributing](CONTRIBUTING.md)
---
## Conversion Demo
<table>
<tr>
<td align="center" width="50%"><strong>DOCX (input)</strong></td>
<td align="center" width="50%"><strong>Markdown (output)</strong></td>
</tr>
<tr>
<td valign="top"><a href="https://github.com/KimSeogyu/undocx/blob/main/docs/undocx_showcase.pdf"><img src="https://raw.githubusercontent.com/KimSeogyu/undocx/main/docs/images/demo-docx.png" alt="DOCX input document"/></a></td>
<td valign="top"><a href="https://github.com/KimSeogyu/undocx/blob/main/docs/undocx_showcase.md"><img src="https://raw.githubusercontent.com/KimSeogyu/undocx/main/docs/images/demo-markdown.png" alt="Converted Markdown output"/></a></td>
</tr>
</table>
> Click images to see full GitHub-rendered files.
## Benchmarks
Measured on 39 DOCX files × 10 iterations ([reproduce it yourself](examples/benchmark_comparison.py)):
| **undocx** | **3.34** | **3.22** | **2.89** | **5.46** |
| markitdown | 18.25 | 17.45 | 14.63 | 41.81 |
| pandoc | 55.08 | 54.11 | 40.31 | 69.51 |
**undocx is 16.5x faster than pandoc and 5.5x faster than markitdown.**
| Language | Rust | Haskell | Python |
| Speed (avg) | 3.3ms/file | 55ms/file | 18ms/file |
| Tables (colspan/rowspan) | Yes | Partial | Yes |
| Track changes | Yes | Yes | No |
| Footnotes/Endnotes | Yes | Yes | No |
| Comments | Yes | No | No |
| VML legacy images | Yes | No | No |
| Korean numbering | Yes | No | No |
| Python API | Yes | CLI only | Yes |
| Rust API | Yes | No | No |
---
## For Humans
Install and convert — that's it.
```bash
pip install undocx # Python
cargo install undocx # CLI
```
**CLI**
```bash
undocx report.docx output.md # convert to file
undocx report.docx # print to stdout
undocx report.docx -o out.md --images-dir ./img # extract images
```
**Python**
```python
import undocx
markdown = undocx.convert_docx("report.docx")
```
---
## For Agents
Designed for document preprocessing in LLM/RAG pipelines.
**Python — RAG ingestion**
```python
import undocx
# Skip images for text-only RAG ingestion
md = undocx.convert_docx("report.docx", image_handling="skip")
# Process bytes from S3, HTTP, or any byte stream
md = undocx.convert_docx(doc_bytes, image_handling="skip")
```
**Rust — One-liner**
```rust
let md = undocx::convert("report.docx")?;
let md = undocx::convert_bytes(&bytes)?;
```
**Rust — Builder (optimal for RAG)**
```rust
let md = undocx::builder()
.skip_images()
.convert("report.docx")?;
```
**Rust — Pluggable architecture**
```rust
let converter = DocxToMarkdown::with_components(
ConvertOptions::default(),
MyExtractor, // impl AstExtractor
MyRenderer, // impl Renderer
);
```
See [docs/API_POLICY.md](docs/API_POLICY.md) for stability guarantees on these traits.
```toml
# Cargo.toml
[dependencies]
undocx = "0.4"
```
**Tips for RAG pipelines:**
- Use `image_handling="skip"` to reduce token count
- Output is clean Markdown — split on `## ` headers for semantic chunking
- Footnotes and comments are preserved as `[^ref]` for full context
---
## Supported Features
| **Text** | Bold, italic, underline, strikethrough, superscript/subscript |
| **Structure** | Heading 1-9, Title, Subtitle, alignment (center/right) |
| **Lists** | Ordered (decimal, letter, roman, Korean, circled), unordered, nested |
| **Tables** | Colspan, rowspan, nested tables, multi-paragraph cells |
| **Links** | External, internal bookmarks, TOC anchors |
| **Images** | Inline, floating, VML legacy — base64 embed, save to dir, or skip |
| **Notes** | Footnotes, endnotes, comments (as Markdown `[^ref]`) |
| **Track changes** | Insertions (`<ins>`), deletions (`~~strikethrough~~`) |
| **Other** | Page/column/line breaks, SDT, field codes, bookmarks, symbols |
## Options
| `image_handling` | `Inline` | `Inline` / `SaveToDir(path)` / `Skip` |
| `preserve_whitespace` | `false` | Keep original spacing |
| `html_underline` | `true` | `<u>` tags for underline |
| `html_strikethrough` | `false` | `<s>` tags instead of `~~` |
| `strict_reference_validation` | `false` | Fail on broken note/comment refs |
## Development
```bash
cargo test --all-features # test
cargo clippy --all-features --tests -- -D warnings # lint
python examples/benchmark_comparison.py ./tests/pandoc 10 # bench
```
See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup and guidelines.
## License
MIT — see [LICENSE](LICENSE)