html-to-markdown
High-performance HTML to Markdown converter built with Rust. Available as:
- Rust crate (
html-to-markdown-rson crates.io) - Python package (
html-to-markdownon PyPI) - CLI binary (via Homebrew, Cargo, or direct download)
Cross-platform support for Linux, macOS, and Windows.
Part of the Kreuzberg ecosystem for document intelligence.
📚 Documentation
- Python Users - Python package documentation and examples
- Rust Users - Rust crate documentation and API reference
- Contributing - Development setup and contribution guidelines
- Changelog - Version history and migration guides
⚡ Benchmarks
Throughput (Python API)
Real Wikipedia documents on Apple M4:
| Document | Size | Latency | Throughput | Docs/sec |
|---|---|---|---|---|
| Lists (Timeline) | 129KB | 0.62ms | 208 MB/s | 1,613 |
| Tables (Countries) | 360KB | 2.02ms | 178 MB/s | 495 |
| Mixed (Python wiki) | 656KB | 4.56ms | 144 MB/s | 219 |
Throughput scales linearly from 144-208 MB/s across all document sizes.
Memory Usage
| Document Size | Memory Delta | Peak RSS | Leak Detection |
|---|---|---|---|
| 10KB | < 2 MB | < 20 MB | ✅ None |
| 50KB | < 8 MB | < 35 MB | ✅ None |
| 500KB | < 40 MB | < 80 MB | ✅ None |
Memory usage is linear and stable across 50+ repeated conversions.
V2 is 19-30x faster than v1 Python/BeautifulSoup implementation.
Features
- 🚀 Blazing Fast: Pure Rust core with ultra-fast
tlHTML parser - 🐍 Python Bindings: Clean Python API via PyO3 with full type hints
- 🦀 Native CLI: Rust CLI binary with comprehensive options
- 📊 hOCR 1.2 Compliant: Full support for all 40+ elements and 20+ properties
- 📝 CommonMark Compliant: Follows CommonMark specification for list formatting
- 🎯 Type Safe: Full type hints and
.pyistubs for excellent IDE support - 🌍 Cross-Platform: Wheels for Linux (x86_64, aarch64), macOS (x86_64, arm64), Windows (x86_64)
- ✅ Well-Tested: 900+ tests with dual Python + Rust coverage
Installation
📦 Package Names: Due to a naming conflict on crates.io, the Rust crate is published as
html-to-markdown-rs, while the Python package remainshtml-to-markdownon PyPI. The CLI binary name ishtml-to-markdownfor both.
Python Package
Rust Library
CLI Binary
via Homebrew (macOS/Linux)
via Cargo
Direct Download
Download pre-built binaries from GitHub Releases.
Quick Start
Python API
Simple function-based API:
=
# Basic conversion
=
# With custom options
=
Output:
This is **fast** Rust-powered conversion!
* +-
For detailed Python documentation, see README_PYPI.md.
Rust API
use ;
For detailed Rust documentation, see crates/html-to-markdown/README.md.
CLI Usage
# Convert file
# From stdin
|
# With options
# Clean web-scraped content
Configuration
Python Configuration
All options available as keyword arguments:
=
Rust Configuration
use ;
let options = ConversionOptions ;
let markdown = convert?;
Common Use Cases
Discord/Slack Compatible Lists
=
Clean Web-Scraped HTML
=
hOCR 1.2 Support
Complete hOCR 1.2 specification compliance:
# Basic hOCR conversion (document structure)
=
# With table extraction from bounding boxes
=
hOCR Features:
- ✅ All 40 element types (logical structure, typesetting, floats, inline, engine-specific)
- ✅ All 20+ properties (bbox, baseline, textangle, poly, confidence scores, fonts, etc.)
- ✅ All 5 metadata fields (system, capabilities, languages, scripts, page count)
- ✅ Semantic markdown conversion (headings, sections, quotes, images, math, etc.)
For complete hOCR documentation, see README_PYPI.md.
Configuration Reference
ConversionOptions
| Option | Type | Default | Description |
|---|---|---|---|
heading_style |
str | "atx" |
Heading format: "atx" (#), "atx_closed" (# #), "underlined" (===) |
list_indent_width |
int | 2 |
Spaces per list indent level (CommonMark: 2) |
list_indent_type |
str | "spaces" |
"spaces" or "tabs" |
bullets |
str | "*+-" |
Bullet chars for unordered lists (cycles through levels) |
strong_em_symbol |
str | "*" |
Symbol for bold/italic: "*" or "_" |
escape_asterisks |
bool | True |
Escape * in text |
escape_underscores |
bool | True |
Escape _ in text |
escape_misc |
bool | False |
Escape other Markdown special chars |
code_language |
str | "" |
Default language for code blocks |
code_block_style |
str | "backticks" |
"indented" (4 spaces), "backticks" (```), "tildes" (~~~) |
highlight_style |
str | "double-equal" |
"double-equal" (==), "html" (), "bold" (**), "none" |
extract_metadata |
bool | True |
Extract HTML metadata as comment |
hocr_extract_tables |
bool | True |
Enable hOCR table extraction |
hocr_table_column_threshold |
int | 50 |
Column detection threshold (pixels) |
hocr_table_row_threshold_ratio |
float | 0.5 |
Row grouping threshold ratio |
PreprocessingOptions
| Option | Type | Default | Description |
|---|---|---|---|
enabled |
bool | False |
Enable HTML preprocessing |
preset |
str | "standard" |
"minimal", "standard", "aggressive" |
remove_navigation |
bool | True |
Remove <nav> and navigation elements |
remove_forms |
bool | True |
Remove <form> and form inputs |
CLI Options
All Python options are available as CLI flags. Use html-to-markdown --help for full reference.
Common CLI flags:
--heading-style <STYLE>: atx, atx-closed, underlined--list-indent-width <N>: Number of spaces for list indentation--bullets <CHARS>: Bullet characters (e.g.,*+-)--code-language <LANG>: Default language for code blocks--preprocess: Enable HTML preprocessing--preset <PRESET>: Preprocessing preset (minimal, standard, aggressive)-o, --output <FILE>: Write output to file
Upgrading from v1.x
Backward Compatibility
Existing v1 code works without changes:
= # Still works!
Modern API (Recommended)
For new projects, use the dataclass-based API:
=
=
What Changed in v2
Core Rewrite:
- Complete Rust rewrite using
tlHTML parser - 19-30x performance improvement over v1
- CommonMark-compliant defaults (2-space indents, minimal escaping, ATX headings)
- No BeautifulSoup or lxml dependencies
Removed Features:
code_language_callback- usecode_languagefor default languagestrip/convertoptions - usestrip_tagsor preprocessingconvert_to_markdown_stream()- not supported in v2
Planned:
custom_converters- planned for future release
See CHANGELOG.md for complete v1 vs v2 comparison and migration guide.
Kreuzberg Ecosystem
html-to-markdown is part of the Kreuzberg ecosystem, a comprehensive framework for document intelligence and processing. While html-to-markdown focuses on converting HTML to Markdown with maximum performance, Kreuzberg provides a complete solution for:
- Document Extraction: Extract text, images, and metadata from 50+ document formats
- OCR Processing: Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR)
- Table Extraction: Vision-based and OCR-based table detection
- Document Classification: Automatic detection of contracts, forms, invoices, etc.
- RAG Pipelines: Integration with retrieval-augmented generation workflows
Learn more at kreuzberg.dev or join our Discord community.
Contributing
See CONTRIBUTING.md for development setup, testing, and contribution guidelines.
License
MIT License - see LICENSE for details.
Acknowledgments
Version 1 started as a fork of markdownify, rewritten, extended, and enhanced with better typing and features. Version 2 is a complete Rust rewrite for high performance.
Support
If you find this library useful, consider:
Your support helps maintain and improve this library!