html-to-markdown
High-performance HTML to Markdown converter built with Rust.
Fast, reliable HTML to Markdown conversion with full CommonMark compliance. Built with html5ever for correctness and ammonia for safe HTML preprocessing.
Rust Library
Installation
[]
= "2.0"
Basic Usage
use ;
Configuration
use ;
let options = ConversionOptions ;
let markdown = convert?;
With Preprocessing
use ;
let mut options = default;
options.preprocessing.enabled = true;
options.preprocessing.preset = Aggressive;
options.preprocessing.remove_navigation = true;
options.preprocessing.remove_forms = true;
let markdown = convert?;
hOCR Table Extraction
use convert;
// hOCR documents (from Tesseract, etc.) are detected automatically.
// Tables and spatial layout are reconstructed without additional options.
let markdown = convert?;
Python Library
Installation
V2 API (Recommended)
Clean, type-safe configuration with dataclasses:
# Basic conversion
=
# With options
=
=
Python Preprocessing
=
=
Python hOCR Support
# hOCR documents are detected automatically; no extra configuration required.
=
V1 Compatibility API
Existing v1 code works without changes:
# All v1 kwargs still supported
=
CLI Installation
via Cargo
via Homebrew (macOS/Linux)
via uv (Python tool installer)
# Install uv if needed
|
# Install html-to-markdown CLI
Download Binary
Download pre-built binaries from GitHub Releases.
CLI Usage
Basic Conversion
# From stdin
|
# From file
# To file
# From stdin to file
|
Common Options
# ATX-style headings (# Heading)
# 2-space list indentation (CommonMark)
# Custom bullet style
# Escape special characters
Web Scraping
# Clean web-scraped HTML
Code Block Styles
# Indented code blocks (default, CommonMark)
# Fenced code blocks with backticks
# With default language
Advanced Options
# Backslash line breaks (default, CommonMark)
# Two-space line breaks
# Custom subscript/superscript symbols
# Strip specific tags (output text only)
# Text wrapping
Shell Completions
# Bash
# Zsh
# Move to completion directory
# Fish
# Move to completion directory
Man Page
Configuration Reference
ConversionOptions
| Field | Type | Default | Description |
|---|---|---|---|
heading_style |
enum | Atx |
Heading format: Atx (#), AtxClosed (# #), Underlined (===) |
list_indent_width |
u8 | 2 |
Spaces per list indent level (CommonMark: 2) |
list_indent_type |
enum | Spaces |
Spaces or Tabs |
bullets |
String | "-" |
Bullet chars for unordered lists (cycles through levels) |
strong_em_symbol |
char | '*' |
Symbol for bold/italic: '*' or '_' |
escape_asterisks |
bool | false |
Escape * in text (minimal escaping by default) |
escape_underscores |
bool | false |
Escape _ in text (minimal escaping by default) |
escape_misc |
bool | false |
Escape other Markdown special chars |
escape_ascii |
bool | false |
Escape all ASCII punctuation |
code_language |
String | "" |
Default language for code blocks |
code_block_style |
enum | Indented |
Indented (4 spaces), Backticks (```), Tildes (~~~) |
autolinks |
bool | true |
Convert bare URLs to <url> |
default_title |
bool | false |
Use href as link title if missing |
br_in_tables |
bool | false |
Preserve <br> in table cells |
highlight_style |
enum | DoubleEqual |
DoubleEqual (==), Html (), Bold (**), None |
extract_metadata |
bool | true |
Extract HTML metadata as comment |
whitespace_mode |
enum | Normalized |
Normalized or Strict |
strip_newlines |
bool | false |
Strip newlines from input |
wrap |
bool | false |
Enable text wrapping |
wrap_width |
usize | 80 |
Wrap column width |
convert_as_inline |
bool | false |
Treat block elements as inline |
sub_symbol |
String | "" |
Custom subscript symbol |
sup_symbol |
String | "" |
Custom superscript symbol |
newline_style |
enum | Backslash |
Backslash (\) or Spaces (two spaces) |
keep_inline_images_in |
Vec | [] |
Elements to keep inline images |
strip_tags |
Vec | [] |
Tags to strip (output text only) |
debug |
bool | false |
Enable debug output |
PreprocessingOptions
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool | false |
Enable HTML preprocessing |
preset |
enum | Standard |
Minimal, Standard, Aggressive |
remove_navigation |
bool | true |
Remove <nav> and navigation elements |
remove_forms |
bool | true |
Remove <form> and form inputs |
V2 Changes from V1
Key Differences
V2 Defaults (CommonMark-compliant):
list_indent_width: 2 (was 4 in v1)bullets: "-" (was "*+-" in v1)escape_asterisks: false (was true in v1)escape_underscores: false (was true in v1)escape_misc: false (was true in v1)newline_style: "backslash" (was "spaces" in v1)code_block_style: "indented" (was "backticks" in v1)heading_style: "atx" (was "underlined" in v1)preprocessing.enabled: false (was true in v1)
Removed Features:
code_language_callback- usecode_languagefor default languagestripoption - usestrip_tagsinsteadconvertoption - all tags converted by defaultconvert_to_markdown_stream()- not supported by html5ever
Not Yet Implemented:
custom_converters- planned for future release
Performance
10-30x faster than v1 Python implementation:
| Document Type | Size | v1 Time | v2 Time | Speedup |
|---|---|---|---|---|
| Small HTML | 5KB | 12ms | 0.8ms | 15x |
| Medium Docs | 150KB | 180ms | 8ms | 22x |
| Large Docs | 800KB | 950ms | 35ms | 27x |
Links
License
MIT License