html-to-markdown-rs 2.1.2

High-performance HTML to Markdown converter using tl parser. Part of the Kreuzberg ecosystem.
Documentation

html-to-markdown

High-performance HTML to Markdown converter built with Rust.

Crates.io PyPI version License: MIT

Fast, reliable HTML to Markdown conversion with full CommonMark compliance. Built with html5ever for correctness and ammonia for safe HTML preprocessing.

Rust Library

Installation

[dependencies]
html-to-markdown-rs = "2.0"

Basic Usage

use html_to_markdown_rs::{convert, ConversionOptions};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = r#"
        <h1>Welcome</h1>
        <p>This is <strong>fast</strong> conversion!</p>
        <ul>
            <li>Built with Rust</li>
            <li>CommonMark compliant</li>
        </ul>
    "#;

    let markdown = convert(html, None)?;
    println!("{}", markdown);
    Ok(())
}

Configuration

use html_to_markdown_rs::{
    convert, ConversionOptions, HeadingStyle, ListIndentType,
    PreprocessingOptions, PreprocessingPreset,
};

let options = ConversionOptions {
    heading_style: HeadingStyle::Atx,
    list_indent_width: 2,
    list_indent_type: ListIndentType::Spaces,
    bullets: "-".to_string(),
    strong_em_symbol: '*',
    escape_asterisks: false,
    escape_underscores: false,
    newline_style: html_to_markdown_rs::NewlineStyle::Backslash,
    code_block_style: html_to_markdown_rs::CodeBlockStyle::Indented,
    ..Default::default()
};

let markdown = convert(html, Some(options))?;

With Preprocessing

use html_to_markdown_rs::{convert, ConversionOptions, PreprocessingOptions};

let mut options = ConversionOptions::default();
options.preprocessing.enabled = true;
options.preprocessing.preset = html_to_markdown_rs::PreprocessingPreset::Aggressive;
options.preprocessing.remove_navigation = true;
options.preprocessing.remove_forms = true;

let markdown = convert(scraped_html, Some(options))?;

hOCR Table Extraction

use html_to_markdown_rs::convert;

// hOCR documents (from Tesseract, etc.) are detected automatically.
// Tables and spatial layout are reconstructed without additional options.
let markdown = convert(hocr_html, None)?;

Python Library

Installation

pip install html-to-markdown

V2 API (Recommended)

Clean, type-safe configuration with dataclasses:

from html_to_markdown import convert, ConversionOptions, PreprocessingOptions

# Basic conversion
markdown = convert(html)

# With options
options = ConversionOptions(
    heading_style="atx",  # "atx", "atx_closed", "underlined"
    list_indent_width=2,  # CommonMark default
    bullets="-",  # Consistent bullet style
    strong_em_symbol="*",  # "*" or "_"
    escape_asterisks=False,  # Minimal escaping (CommonMark)
    escape_underscores=False,
    escape_misc=False,
    newline_style="backslash",  # "backslash" or "spaces"
    code_block_style="indented",  # "indented", "backticks", "tildes"
    extract_metadata=True,
    autolinks=True,
)

markdown = convert(html, options)

Python Preprocessing

from html_to_markdown import (
    convert,
    ConversionOptions,
    PreprocessingOptions,
)

preprocessing = PreprocessingOptions(
    enabled=True,
    preset="aggressive",  # "minimal", "standard", "aggressive"
    remove_navigation=True,
    remove_forms=True,
)

markdown = convert(scraped_html, preprocessing=preprocessing)

Python hOCR Support

from html_to_markdown import convert

# hOCR documents are detected automatically; no extra configuration required.
markdown = convert(hocr_html)

V1 Compatibility API

Existing v1 code works without changes:

from html_to_markdown import convert_to_markdown

# All v1 kwargs still supported
markdown = convert_to_markdown(
    html,
    heading_style="atx",
    list_indent_width=2,
    escape_asterisks=True,
    preprocess=True,
)

CLI Installation

via Cargo

cargo install html-to-markdown-cli

via Homebrew (macOS/Linux)

brew tap goldziher/tap
brew install html-to-markdown

via uv (Python tool installer)

# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install html-to-markdown CLI
uvx --from html-to-markdown html-to-markdown

Download Binary

Download pre-built binaries from GitHub Releases.

CLI Usage

Basic Conversion

# From stdin
echo '<h1>Title</h1>' | html-to-markdown

# From file
html-to-markdown input.html

# To file
html-to-markdown input.html -o output.md

# From stdin to file
cat input.html | html-to-markdown > output.md

Common Options

# ATX-style headings (# Heading)
html-to-markdown --heading-style atx input.html

# 2-space list indentation (CommonMark)
html-to-markdown --list-indent-width 2 input.html

# Custom bullet style
html-to-markdown --bullets '*+-' input.html

# Escape special characters
html-to-markdown --escape-asterisks --escape-underscores input.html

Web Scraping

# Clean web-scraped HTML
html-to-markdown \
    --preprocess \
    --preset aggressive \
    --keep-navigation false \
    --keep-forms false \
    scraped.html

Code Block Styles

# Indented code blocks (default, CommonMark)
html-to-markdown --code-block-style indented input.html

# Fenced code blocks with backticks
html-to-markdown --code-block-style backticks input.html

# With default language
html-to-markdown --code-block-style backticks --code-language python input.html

Advanced Options

# Backslash line breaks (default, CommonMark)
html-to-markdown --newline-style backslash input.html

# Two-space line breaks
html-to-markdown --newline-style spaces input.html

# Custom subscript/superscript symbols
html-to-markdown --sub-symbol '~' --sup-symbol '^' input.html

# Strip specific tags (output text only)
html-to-markdown --strip-tags 'script,style' input.html

# Text wrapping
html-to-markdown --wrap --wrap-width 80 input.html

Shell Completions

# Bash
html-to-markdown --generate-completion bash > html-to-markdown.bash
source html-to-markdown.bash

# Zsh
html-to-markdown --generate-completion zsh > _html-to-markdown
# Move to completion directory

# Fish
html-to-markdown --generate-completion fish > html-to-markdown.fish
# Move to completion directory

Man Page

html-to-markdown --generate-man > html-to-markdown.1
man ./html-to-markdown.1

Configuration Reference

ConversionOptions

Field Type Default Description
heading_style enum Atx Heading format: Atx (#), AtxClosed (# #), Underlined (===)
list_indent_width u8 2 Spaces per list indent level (CommonMark: 2)
list_indent_type enum Spaces Spaces or Tabs
bullets String "-" Bullet chars for unordered lists (cycles through levels)
strong_em_symbol char '*' Symbol for bold/italic: '*' or '_'
escape_asterisks bool false Escape * in text (minimal escaping by default)
escape_underscores bool false Escape _ in text (minimal escaping by default)
escape_misc bool false Escape other Markdown special chars
escape_ascii bool false Escape all ASCII punctuation
code_language String "" Default language for code blocks
code_block_style enum Indented Indented (4 spaces), Backticks (```), Tildes (~~~)
autolinks bool true Convert bare URLs to <url>
default_title bool false Use href as link title if missing
br_in_tables bool false Preserve <br> in table cells
highlight_style enum DoubleEqual DoubleEqual (==), Html (), Bold (**), None
extract_metadata bool true Extract HTML metadata as comment
whitespace_mode enum Normalized Normalized or Strict
strip_newlines bool false Strip newlines from input
wrap bool false Enable text wrapping
wrap_width usize 80 Wrap column width
convert_as_inline bool false Treat block elements as inline
sub_symbol String "" Custom subscript symbol
sup_symbol String "" Custom superscript symbol
newline_style enum Backslash Backslash (\) or Spaces (two spaces)
keep_inline_images_in Vec [] Elements to keep inline images
strip_tags Vec [] Tags to strip (output text only)
debug bool false Enable debug output

PreprocessingOptions

Field Type Default Description
enabled bool false Enable HTML preprocessing
preset enum Standard Minimal, Standard, Aggressive
remove_navigation bool true Remove <nav> and navigation elements
remove_forms bool true Remove <form> and form inputs

V2 Changes from V1

Key Differences

V2 Defaults (CommonMark-compliant):

  • list_indent_width: 2 (was 4 in v1)
  • bullets: "-" (was "*+-" in v1)
  • escape_asterisks: false (was true in v1)
  • escape_underscores: false (was true in v1)
  • escape_misc: false (was true in v1)
  • newline_style: "backslash" (was "spaces" in v1)
  • code_block_style: "indented" (was "backticks" in v1)
  • heading_style: "atx" (was "underlined" in v1)
  • preprocessing.enabled: false (was true in v1)

Removed Features:

  • code_language_callback - use code_language for default language
  • strip option - use strip_tags instead
  • convert option - all tags converted by default
  • convert_to_markdown_stream() - not supported by html5ever

Not Yet Implemented:

  • custom_converters - planned for future release

Performance

10-30x faster than v1 Python implementation:

Document Type Size v1 Time v2 Time Speedup
Small HTML 5KB 12ms 0.8ms 15x
Medium Docs 150KB 180ms 8ms 22x
Large Docs 800KB 950ms 35ms 27x

Links

License

MIT License