html-to-markdown

High-performance HTML to Markdown converter built with Rust.

Fast, reliable HTML to Markdown conversion with full CommonMark compliance. Built with html5ever for correctness and ammonia for safe HTML preprocessing.

Rust Library

Installation

[dependencies]
html-to-markdown-rs = "2.0"

Basic Usage

use html_to_markdown_rs::{convert, ConversionOptions};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = r#"
        <h1>Welcome</h1>
        <p>This is <strong>fast</strong> conversion!</p>
        <ul>
            <li>Built with Rust</li>
            <li>CommonMark compliant</li>
        </ul>
    "#;

    let markdown = convert(html, None)?;
    println!("{}", markdown);
    Ok(())
}

Configuration

use html_to_markdown_rs::{
    convert, ConversionOptions, HeadingStyle, ListIndentType,
    PreprocessingOptions, PreprocessingPreset,
};

let options = ConversionOptions {
    heading_style: HeadingStyle::Atx,
    list_indent_width: 2,
    list_indent_type: ListIndentType::Spaces,
    bullets: "-".to_string(),
    strong_em_symbol: '*',
    escape_asterisks: false,
    escape_underscores: false,
    newline_style: html_to_markdown_rs::NewlineStyle::Backslash,
    code_block_style: html_to_markdown_rs::CodeBlockStyle::Indented,
    ..Default::default()
};

let markdown = convert(html, Some(options))?;

With Preprocessing

use html_to_markdown_rs::{convert, ConversionOptions, PreprocessingOptions};

let mut options = ConversionOptions::default();
options.preprocessing.enabled = true;
options.preprocessing.preset = html_to_markdown_rs::PreprocessingPreset::Aggressive;
options.preprocessing.remove_navigation = true;
options.preprocessing.remove_forms = true;

let markdown = convert(scraped_html, Some(options))?;

hOCR Table Extraction

use html_to_markdown_rs::convert;

// hOCR documents (from Tesseract, etc.) are detected automatically.
// Tables and spatial layout are reconstructed without additional options.
let markdown = convert(hocr_html, None)?;

Python Library

Installation

pip install html-to-markdown

V2 API (Recommended)

Clean, type-safe configuration with dataclasses:

from html_to_markdown import convert, ConversionOptions, PreprocessingOptions

# Basic conversion
markdown = convert(html)

# With options
options = ConversionOptions(
    heading_style="atx",  # "atx", "atx_closed", "underlined"
    list_indent_width=2,  # CommonMark default
    bullets="-",  # Consistent bullet style
    strong_em_symbol="*",  # "*" or "_"
    escape_asterisks=False,  # Minimal escaping (CommonMark)
    escape_underscores=False,
    escape_misc=False,
    newline_style="backslash",  # "backslash" or "spaces"
    code_block_style="indented",  # "indented", "backticks", "tildes"
    extract_metadata=True,
    autolinks=True,
)

markdown = convert(html, options)

Python Preprocessing

from html_to_markdown import (
    convert,
    ConversionOptions,
    PreprocessingOptions,
)

preprocessing = PreprocessingOptions(
    enabled=True,
    preset="aggressive",  # "minimal", "standard", "aggressive"
    remove_navigation=True,
    remove_forms=True,
)

markdown = convert(scraped_html, preprocessing=preprocessing)

Python hOCR Support

from html_to_markdown import convert

# hOCR documents are detected automatically; no extra configuration required.
markdown = convert(hocr_html)

V1 Compatibility API

Existing v1 code works without changes:

from html_to_markdown import convert_to_markdown

# All v1 kwargs still supported
markdown = convert_to_markdown(
    html,
    heading_style="atx",
    list_indent_width=2,
    escape_asterisks=True,
    preprocess=True,
)

CLI Installation

via Cargo

cargo install html-to-markdown-cli

via Homebrew (macOS/Linux)

brew tap goldziher/tap
brew install html-to-markdown

via uv (Python tool installer)

# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install html-to-markdown CLI
uvx --from html-to-markdown html-to-markdown

Download Binary

Download pre-built binaries from GitHub Releases.

CLI Usage

Basic Conversion

# From stdin
echo '<h1>Title</h1>' | html-to-markdown

# From file
html-to-markdown input.html

# To file
html-to-markdown input.html -o output.md

# From stdin to file
cat input.html | html-to-markdown > output.md

Common Options

# ATX-style headings (# Heading)
html-to-markdown --heading-style atx input.html

# 2-space list indentation (CommonMark)
html-to-markdown --list-indent-width 2 input.html

# Custom bullet style
html-to-markdown --bullets '*+-' input.html

# Escape special characters
html-to-markdown --escape-asterisks --escape-underscores input.html

Web Scraping

# Clean web-scraped HTML
html-to-markdown \
    --preprocess \
    --preset aggressive \
    --keep-navigation false \
    --keep-forms false \
    scraped.html

Code Block Styles

# Indented code blocks (default, CommonMark)
html-to-markdown --code-block-style indented input.html

# Fenced code blocks with backticks
html-to-markdown --code-block-style backticks input.html

# With default language
html-to-markdown --code-block-style backticks --code-language python input.html

Advanced Options

# Backslash line breaks (default, CommonMark)
html-to-markdown --newline-style backslash input.html

# Two-space line breaks
html-to-markdown --newline-style spaces input.html

# Custom subscript/superscript symbols
html-to-markdown --sub-symbol '~' --sup-symbol '^' input.html

# Strip specific tags (output text only)
html-to-markdown --strip-tags 'script,style' input.html

# Text wrapping
html-to-markdown --wrap --wrap-width 80 input.html

Shell Completions

# Bash
html-to-markdown --generate-completion bash > html-to-markdown.bash
source html-to-markdown.bash

# Zsh
html-to-markdown --generate-completion zsh > _html-to-markdown
# Move to completion directory

# Fish
html-to-markdown --generate-completion fish > html-to-markdown.fish
# Move to completion directory

Man Page

html-to-markdown --generate-man > html-to-markdown.1
man ./html-to-markdown.1

Configuration Reference

ConversionOptions

Field	Type	Default	Description
`heading_style`	enum	`Atx`	Heading format: `Atx` (#), `AtxClosed` (# #), `Underlined` (===)
`list_indent_width`	u8	`2`	Spaces per list indent level (CommonMark: 2)
`list_indent_type`	enum	`Spaces`	`Spaces` or `Tabs`
`bullets`	String	`"-"`	Bullet chars for unordered lists (cycles through levels)
`strong_em_symbol`	char	`'*'`	Symbol for bold/italic: `'*'` or `'_'`
`escape_asterisks`	bool	`false`	Escape `*` in text (minimal escaping by default)
`escape_underscores`	bool	`false`	Escape `_` in text (minimal escaping by default)
`escape_misc`	bool	`false`	Escape other Markdown special chars
`escape_ascii`	bool	`false`	Escape all ASCII punctuation
`code_language`	String	`""`	Default language for code blocks
`code_block_style`	enum	`Indented`	`Indented` (4 spaces), `Backticks` (```), `Tildes` (~~~)
`autolinks`	bool	`true`	Convert bare URLs to `<url>`
`default_title`	bool	`false`	Use href as link title if missing
`br_in_tables`	bool	`false`	Preserve `<br>` in table cells
`highlight_style`	enum	`DoubleEqual`	`DoubleEqual` (==), `Html` (), `Bold` (**), `None`
`extract_metadata`	bool	`true`	Extract HTML metadata as comment
`whitespace_mode`	enum	`Normalized`	`Normalized` or `Strict`
`strip_newlines`	bool	`false`	Strip newlines from input
`wrap`	bool	`false`	Enable text wrapping
`wrap_width`	usize	`80`	Wrap column width
`convert_as_inline`	bool	`false`	Treat block elements as inline
`sub_symbol`	String	`""`	Custom subscript symbol
`sup_symbol`	String	`""`	Custom superscript symbol
`newline_style`	enum	`Backslash`	`Backslash` (\) or `Spaces` (two spaces)
`keep_inline_images_in`	Vec	`[]`	Elements to keep inline images
`strip_tags`	Vec	`[]`	Tags to strip (output text only)
`debug`	bool	`false`	Enable debug output

PreprocessingOptions

Field	Type	Default	Description
`enabled`	bool	`false`	Enable HTML preprocessing
`preset`	enum	`Standard`	`Minimal`, `Standard`, `Aggressive`
`remove_navigation`	bool	`true`	Remove `<nav>` and navigation elements
`remove_forms`	bool	`true`	Remove `<form>` and form inputs

V2 Changes from V1

Key Differences

V2 Defaults (CommonMark-compliant):

list_indent_width: 2 (was 4 in v1)
bullets: "-" (was "*+-" in v1)
escape_asterisks: false (was true in v1)
escape_underscores: false (was true in v1)
escape_misc: false (was true in v1)
newline_style: "backslash" (was "spaces" in v1)
code_block_style: "indented" (was "backticks" in v1)
heading_style: "atx" (was "underlined" in v1)
preprocessing.enabled: false (was true in v1)

Removed Features:

code_language_callback - use code_language for default language
strip option - use strip_tags instead
convert option - all tags converted by default
convert_to_markdown_stream() - not supported by html5ever

Not Yet Implemented:

custom_converters - planned for future release

Performance

10-30x faster than v1 Python implementation:

Document Type	Size	v1 Time	v2 Time	Speedup
Small HTML	5KB	12ms	0.8ms	15x
Medium Docs	150KB	180ms	8ms	22x
Large Docs	800KB	950ms	35ms	27x

License

MIT License

html-to-markdown-rs 2.1.2

html-to-markdown

Rust Library

Installation

Basic Usage

Configuration

With Preprocessing

hOCR Table Extraction

Python Library

Installation

V2 API (Recommended)

Python Preprocessing

Python hOCR Support

V1 Compatibility API

CLI Installation

via Cargo

via Homebrew (macOS/Linux)

via uv (Python tool installer)

Download Binary

CLI Usage

Basic Conversion

Common Options

Web Scraping

Code Block Styles

Advanced Options

Shell Completions

Man Page

Configuration Reference

ConversionOptions

PreprocessingOptions

V2 Changes from V1

Key Differences

Performance

Links

License