html-to-markdown-cli-2.0.1 is not a library.

html-to-markdown

High-performance HTML to Markdown converter built with Rust. Available as:

Rust crate (html-to-markdown-rs on crates.io)
Python package (html-to-markdown on PyPI)
CLI binary (via Homebrew, Cargo, or direct download)

Cross-platform support for Linux, macOS, and Windows.

Part of the Kreuzberg ecosystem for document intelligence.

📚 Documentation

Python Users - Python package documentation and examples
Rust Users - Rust crate documentation and API reference
Contributing - Development setup and contribution guidelines
Changelog - Version history and migration guides

⚡ Benchmarks

Throughput (Python API)

Real Wikipedia documents on Apple M4:

Document	Size	Latency	Throughput	Docs/sec
Lists (Timeline)	129KB	0.62ms	208 MB/s	1,613
Tables (Countries)	360KB	2.02ms	178 MB/s	495
Mixed (Python wiki)	656KB	4.56ms	144 MB/s	219

Throughput scales linearly from 144-208 MB/s across all document sizes.

Memory Usage

Document Size	Memory Delta	Peak RSS	Leak Detection
10KB	< 2 MB	< 20 MB	✅ None
50KB	< 8 MB	< 35 MB	✅ None
500KB	< 40 MB	< 80 MB	✅ None

Memory usage is linear and stable across 50+ repeated conversions.

V2 is 19-30x faster than v1 Python/BeautifulSoup implementation.

Features

🚀 Blazing Fast: Pure Rust core with ultra-fast tl HTML parser
🐍 Python Bindings: Clean Python API via PyO3 with full type hints
🦀 Native CLI: Rust CLI binary with comprehensive options
📊 hOCR 1.2 Compliant: Full support for all 40+ elements and 20+ properties
📝 CommonMark Compliant: Follows CommonMark specification for list formatting
🎯 Type Safe: Full type hints and .pyi stubs for excellent IDE support
🌍 Cross-Platform: Wheels for Linux (x86_64, aarch64), macOS (x86_64, arm64), Windows (x86_64)
✅ Well-Tested: 900+ tests with dual Python + Rust coverage

Installation

📦 Package Names: Due to a naming conflict on crates.io, the Rust crate is published as html-to-markdown-rs, while the Python package remains html-to-markdown on PyPI. The CLI binary name is html-to-markdown for both.

Python Package

pip install html-to-markdown

Rust Library

cargo add html-to-markdown-rs

CLI Binary

via Homebrew (macOS/Linux)

brew tap goldziher/tap
brew install html-to-markdown

via Cargo

cargo install html-to-markdown-cli

Direct Download

Download pre-built binaries from GitHub Releases.

Quick Start

Python API

Simple function-based API:

from html_to_markdown import convert_to_markdown

html = """
<h1>Welcome</h1>
<p>This is <strong>fast</strong> Rust-powered conversion!</p>
<ul>
    <li>Blazing fast</li>
    <li>Type safe</li>
    <li>Easy to use</li>
</ul>
"""

# Basic conversion
markdown = convert_to_markdown(html)

# With custom options
markdown = convert_to_markdown(
    html,
    heading_style="atx",
    strong_em_symbol="*",
    bullets="*+-",
)

print(markdown)

Output:

# Welcome

This is **fast** Rust-powered conversion!

* Blazing fast
+ Type safe
- Easy to use

For detailed Python documentation, see README_PYPI.md.

Rust API

use html_to_markdown_rs::{convert, ConversionOptions, HeadingStyle};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = r#"
        <h1>Welcome</h1>
        <p>This is <strong>fast</strong> conversion!</p>
        <ul>
            <li>Blazing fast</li>
            <li>Type safe</li>
            <li>Easy to use</li>
        </ul>
    "#;

    // Basic conversion
    let markdown = convert(html, None)?;

    // With custom options
    let options = ConversionOptions {
        heading_style: HeadingStyle::Atx,
        bullets: "*+-".to_string(),
        ..Default::default()
    };
    let markdown = convert(html, Some(options))?;

    println!("{}", markdown);
    Ok(())
}

For detailed Rust documentation, see crates/html-to-markdown/README.md.

CLI Usage

# Convert file
html-to-markdown input.html > output.md

# From stdin
cat input.html | html-to-markdown > output.md

# With options
html-to-markdown --heading-style atx --list-indent-width 2 input.html

# Clean web-scraped content
html-to-markdown \
    --preprocess \
    --preset aggressive \
    --no-extract-metadata \
    scraped.html > clean.md

Configuration

Python Configuration

All options available as keyword arguments:

from html_to_markdown import convert_to_markdown

markdown = convert_to_markdown(
    html,
    # Heading options
    heading_style="atx",  # "atx", "atx_closed", "underlined"
    # List options
    list_indent_width=2,  # Discord/Slack: use 2
    bullets="*+-",  # Bullet characters (cycles through levels)
    # Text formatting
    strong_em_symbol="*",  # "*" or "_"
    escape_asterisks=True,  # Escape * in text
    escape_underscores=True,  # Escape _ in text
    # Code blocks
    code_language="python",  # Default code block language
    code_block_style="backticks",  # "indented", "backticks", "tildes"
    # HTML preprocessing
    preprocess=True,  # Enable HTML cleaning
    preprocessing_preset="standard",  # "minimal", "standard", "aggressive"
    # Metadata
    extract_metadata=True,  # Extract HTML metadata
)

Rust Configuration

use html_to_markdown_rs::{
    convert, ConversionOptions, HeadingStyle,
    CodeBlockStyle, PreprocessingPreset
};

let options = ConversionOptions {
    // Heading options
    heading_style: HeadingStyle::Atx,

    // List options
    list_indent_width: 2,
    bullets: "*+-".to_string(),

    // Text formatting
    strong_em_symbol: '*',
    escape_asterisks: false,
    escape_underscores: false,

    // Code blocks
    code_block_style: CodeBlockStyle::Backticks,
    code_language: "python".to_string(),

    // HTML preprocessing
    preprocessing: html_to_markdown_rs::PreprocessingOptions {
        enabled: true,
        preset: PreprocessingPreset::Standard,
        ..Default::default()
    },

    ..Default::default()
};

let markdown = convert(html, Some(options))?;

Common Use Cases

Discord/Slack Compatible Lists

from html_to_markdown import convert_to_markdown

markdown = convert_to_markdown(html, list_indent_width=2)

Clean Web-Scraped HTML

from html_to_markdown import convert_to_markdown

markdown = convert_to_markdown(
    scraped_html,
    preprocess=True,
    preprocessing_preset="aggressive",
)

hOCR 1.2 Support

Complete hOCR 1.2 specification compliance:

from html_to_markdown import convert_to_markdown

# Basic hOCR conversion (document structure)
markdown = convert_to_markdown(hocr_html)

# With table extraction from bounding boxes
markdown = convert_to_markdown(
    hocr_html,
    hocr_extract_tables=True,
    hocr_table_column_threshold=50,
)

hOCR Features:

✅ All 40 element types (logical structure, typesetting, floats, inline, engine-specific)
✅ All 20+ properties (bbox, baseline, textangle, poly, confidence scores, fonts, etc.)
✅ All 5 metadata fields (system, capabilities, languages, scripts, page count)
✅ Semantic markdown conversion (headings, sections, quotes, images, math, etc.)

For complete hOCR documentation, see README_PYPI.md.

Configuration Reference

ConversionOptions

Option	Type	Default	Description
`heading_style`	str	`"atx"`	Heading format: `"atx"` (#), `"atx_closed"` (# #), `"underlined"` (===)
`list_indent_width`	int	`2`	Spaces per list indent level (CommonMark: 2)
`list_indent_type`	str	`"spaces"`	`"spaces"` or `"tabs"`
`bullets`	str	`"*+-"`	Bullet chars for unordered lists (cycles through levels)
`strong_em_symbol`	str	`"*"`	Symbol for bold/italic: `"*"` or `"_"`
`escape_asterisks`	bool	`True`	Escape `*` in text
`escape_underscores`	bool	`True`	Escape `_` in text
`escape_misc`	bool	`False`	Escape other Markdown special chars
`code_language`	str	`""`	Default language for code blocks
`code_block_style`	str	`"backticks"`	`"indented"` (4 spaces), `"backticks"` (```), `"tildes"` (~~~)
`highlight_style`	str	`"double-equal"`	`"double-equal"` (==), `"html"` (), `"bold"` (**), `"none"`
`extract_metadata`	bool	`True`	Extract HTML metadata as comment
`hocr_extract_tables`	bool	`True`	Enable hOCR table extraction
`hocr_table_column_threshold`	int	`50`	Column detection threshold (pixels)
`hocr_table_row_threshold_ratio`	float	`0.5`	Row grouping threshold ratio

PreprocessingOptions

Option	Type	Default	Description
`enabled`	bool	`False`	Enable HTML preprocessing
`preset`	str	`"standard"`	`"minimal"`, `"standard"`, `"aggressive"`
`remove_navigation`	bool	`True`	Remove `<nav>` and navigation elements
`remove_forms`	bool	`True`	Remove `<form>` and form inputs

CLI Options

All Python options are available as CLI flags. Use html-to-markdown --help for full reference.

Common CLI flags:

--heading-style <STYLE>: atx, atx-closed, underlined
--list-indent-width <N>: Number of spaces for list indentation
--bullets <CHARS>: Bullet characters (e.g., *+-)
--code-language <LANG>: Default language for code blocks
--preprocess: Enable HTML preprocessing
--preset <PRESET>: Preprocessing preset (minimal, standard, aggressive)
-o, --output <FILE>: Write output to file

Upgrading from v1.x

Backward Compatibility

Existing v1 code works without changes:

from html_to_markdown import convert_to_markdown

markdown = convert_to_markdown(html, heading_style="atx")  # Still works!

Modern API (Recommended)

For new projects, use the dataclass-based API:

from html_to_markdown import convert, ConversionOptions

options = ConversionOptions(heading_style="atx", list_indent_width=2)
markdown = convert(html, options)

What Changed in v2

Core Rewrite:

Complete Rust rewrite using tl HTML parser
19-30x performance improvement over v1
CommonMark-compliant defaults (2-space indents, minimal escaping, ATX headings)
No BeautifulSoup or lxml dependencies

Removed Features:

code_language_callback - use code_language for default language
strip / convert options - use strip_tags or preprocessing
convert_to_markdown_stream() - not supported in v2

Planned:

custom_converters - planned for future release

See CHANGELOG.md for complete v1 vs v2 comparison and migration guide.

Kreuzberg Ecosystem

html-to-markdown is part of the Kreuzberg ecosystem, a comprehensive framework for document intelligence and processing. While html-to-markdown focuses on converting HTML to Markdown with maximum performance, Kreuzberg provides a complete solution for:

Document Extraction: Extract text, images, and metadata from 50+ document formats
OCR Processing: Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR)
Table Extraction: Vision-based and OCR-based table detection
Document Classification: Automatic detection of contracts, forms, invoices, etc.
RAG Pipelines: Integration with retrieval-augmented generation workflows

Learn more at kreuzberg.dev or join our Discord community.

Contributing

See CONTRIBUTING.md for development setup, testing, and contribution guidelines.

License

MIT License - see LICENSE for details.

Acknowledgments

Version 1 started as a fork of markdownify, rewritten, extended, and enhanced with better typing and features. Version 2 is a complete Rust rewrite for high performance.

Support

If you find this library useful, consider:

Your support helps maintain and improve this library!

html-to-markdown-cli 2.0.1