dm2xcod
DOCX to Markdown converter in Rust with Python bindings.
Table of Contents
- Why dm2xcod
- Requirements
- Installation
- Quick Start
- API Reference
- CLI Reference
- Architecture Overview
- Development
- License
Why dm2xcod
- Rust-based converter focused on predictable performance.
- Covers common DOCX structures: headings, lists, tables, notes, links, images.
- Supports image handling strategies: inline base64, save to directory, or skip.
- Exposes both CLI and Python (
PyO3) entry points. - Includes strict reference validation for footnote/comment/endnote integrity.
Requirements
- Rust
1.75+(building from source) - Python
3.12+(ABI3 wheel compatibility)
Installation
Python package
CLI (cargo)
Rust library
[]
= "0.3"
Quick Start
CLI
# write to file
# print markdown to stdout
Python
# path input
=
# bytes input
=
Rust
use ;
API Reference
ConvertOptions
| Field | Type | Default | Description |
|---|---|---|---|
image_handling |
ImageHandling |
Inline |
Image output strategy |
preserve_whitespace |
bool |
false |
Preserve original spacing more strictly |
html_underline |
bool |
true |
Use HTML tags for underline output |
html_strikethrough |
bool |
false |
Use HTML tags for strikethrough output |
strict_reference_validation |
bool |
false |
Fail on unresolved note/comment references |
ImageHandling variants:
ImageHandling::InlineImageHandling::SaveToDir(PathBuf)ImageHandling::Skip
Example with non-default options:
use ;
Advanced: Custom extractor/renderer injection
DocxToMarkdown::with_components(options, extractor, renderer) lets you replace the default pipeline.
use AstExtractor;
use ConversionContext;
use ;
use Renderer;
use ;
use BodyContent;
;
;
Python API
dm2xcod.convert_docx(input: str | bytes) -> str- Current Python entry point uses default conversion options.
CLI Reference
dm2xcod <INPUT> [OUTPUT] [--images-dir <DIR>] [--skip-images]
| Argument/Option | Description |
|---|---|
<INPUT> |
Input DOCX path (required) |
[OUTPUT] |
Output Markdown path (optional, otherwise stdout) |
--images-dir <DIR> |
Save extracted images to a directory |
--skip-images |
Skip image extraction/output |
Architecture Overview
Conversion pipeline:
- Parse DOCX (
rs_docx) - Build conversion context (relationships, numbering, styles, references, image strategy)
- Extract AST via adapter (
AstExtractor) - Validate references (optional strict mode)
- Render final markdown via renderer (
Renderer)
Project layout:
src/
adapters/ # Input adapters (DOCX -> AST extraction boundary)
core/ # Shared AST/model types
converter/ # Orchestration and conversion context
render/ # Markdown rendering + escaping
lib.rs # Public API (Rust + Python bindings)
main.rs # CLI entrypoint
Development
Build from source
# Rust library/CLI
# Python extension in local env
Test and lint
Performance benchmark
# default: tests/aaa, 3 iterations, max 5 files
# custom: input_dir iterations max_files
Latest benchmark record (2026-02-14):
- Command:
./scripts/run_perf_benchmark.sh ./tests/aaa 10 10 - Threshold gate:
./scripts/check_perf_threshold.sh ./output_tests/perf/latest.json 15.0(pass) - Environment:
macOS 26.2 (Darwin arm64),rustc 1.92.0 (ded5c06cf 2025-12-08) - Result file:
output_tests/perf/latest.json
Performance threshold gate
# fails if avg_ms exceeds threshold
Release notes
# auto-detect previous tag to HEAD
# explicit range and output file
API stability policy
See docs/API_POLICY.md.
License
MIT