html-to-markdown-cli-3.6.0-rc.20 is not a library.

html-to-markdown

Fast, robust HTML → Markdown for 16 languages. A tiered converter that picks the safest fastest path per input without losing content.

Documentation | API Reference

Highlights

16 languages, one Rust core. Rust, Python, Node.js, WASM, Java, Go, C#, PHP, Ruby, Elixir, R, Dart, Kotlin (Android), Swift, Zig, C ABI.
Tiered dispatch. Byte scanner for clean HTML → DOM walker for complex inputs → html5ever repair for malformed HTML. Byte-equal output across tiers.
Real-HTML robust. Unclosed tags, CDATA, custom elements, malformed entities, nested tables, mixed encodings — handled without losing content.
GFM tables, Djot output, metadata extraction, visitor API, inline images, configurable preprocessing presets.
CommonMark-compatible Markdown with GFM-style tables. output_format = "djot" switches to Djot.
116-snapshot oracle + per-group regression gates in CI. Performance and correctness both enforced.

Architecture

The converter routes each input through one of three tiers based on a fast prescan of the byte stream:

Tier-1 — Single-pass byte scanner. Walks html.as_bytes() once and emits Markdown directly. Handles 110+ HTML tags including paragraphs, headings, lists, GFM tables, links, images, inline emphasis, blockquotes, indented code blocks. Bails (returns a structured error) on any construct it cannot prove byte-equivalent to Tier-2 — custom elements, CDATA, malformed entities, nested tables, mixed table sections, multi-line table cells, etc.
Tier-2 — tl::parse DOM walker. Picks up Tier-1's bails and inputs the classifier rejected up front (non-default style options, non-Markdown output, etc.). Handles the full HTML5 spec via a tolerant DOM walk.
Tier-3 — html5ever standards-conformant parser. Engaged when Tier-2 detects HTML requiring full HTML5 repair (custom elements, structural recovery, weird namespace transitions).

The dispatcher is invisible to the caller. The same convert() call works regardless of which tier handled the input; the output is byte-identical across tiers (enforced by a 116-snapshot oracle).

Performance

Best-of-3 measurements on the harness corpus (Apple Silicon, cargo build --release):

Fixture	Size	ms (best of 3)	Throughput
`wikipedia/medium_python.html`	1.24 MB	62.58 ms	19.0 MB/s
`wikipedia/large_rust.html`	1.07 MB	37.17 ms	27.3 MB/s
`wikipedia/small_html.html`	973 KB	29.32 ms	31.6 MB/s
`wikipedia/tables_countries.html`	756 KB	18.95 ms	38.1 MB/s
`mdream/github-markdown-complete.html`	430 KB	10.57 ms	38.7 MB/s
`mdream/react-learn.html`	265 KB	12.11 ms	20.9 MB/s
`mdream/wikipedia-small.html`	166 KB	5.63 ms	28.1 MB/s
`issues/gh-121-hacker-news.html`	57 KB	1.08 ms	50.3 MB/s
`mdream/nuxt-example.html`	3.6 KB	0.029 ms	116.1 MB/s

Corpus: 29 fixtures totalling 6.4 MB across clean_small, clean_medium, clean_large, spec_rules, adversarial, and fallthrough_* groups. Per-group regression thresholds (5–30%) are enforced on every PR via task bench:compare. Run task bench:run to reproduce on your hardware.

Capabilities

HTML element coverage: 110+ tags handled natively in Tier-1; full HTML5 coverage via Tier-2/Tier-3 fallback.
GFM-style tables with padded cells, alignment, and pipe escaping.
Djot output: set ConversionOptions { output_format: OutputFormat::Djot, .. } to emit Djot instead of Markdown.
Metadata extraction: parse <head> into structured HtmlMetadata (open-graph, twitter, JSON-LD, microdata, RDFa, header hierarchy).
Inline images: opt-in via inline-images feature; mirrors data URIs and remote image references.
Visitor API: feature-gated traversal that lets callers transform the converted Markdown AST (visitor feature).
Configurable preprocessing: standard, strict, lenient presets — or build your own.
Tiered fallback: Tier-3 (html5ever) handles inputs the other tiers cannot, so the converter never silently corrupts malformed HTML.

Quick Start

# Rust
cargo add html-to-markdown-rs

# Python
pip install html-to-markdown

# TypeScript / Node.js
npm install @kreuzberg/html-to-markdown-node

# Ruby
gem install html-to-markdown

# CLI
cargo install html-to-markdown-cli
# or
brew install kreuzberg-dev/tap/html-to-markdown

See the package READMEs for the full list: PHP, Go, Java, C#, Elixir, R, Dart, Kotlin (Android), Swift, Zig, WASM, and a C ABI for everything else.

Usage

convert() is the single entry point. It returns a structured ConversionResult:

# Python
from html_to_markdown import convert

result = convert("<h1>Hello</h1><p>World</p>")
print(result.content)        # # Hello\n\nWorld
print(result.metadata)       # title, links, headings, …

// TypeScript / Node.js
import { convert } from "@kreuzberg/html-to-markdown";

const result = convert("<h1>Hello</h1><p>World</p>");
console.log(result.content); // # Hello\n\nWorld
console.log(result.metadata); // title, links, headings, …

// Rust
use html_to_markdown_rs::convert;

let result = convert("<h1>Hello</h1><p>World</p>", None)?;
println!("{}", result.content.unwrap_or_default());

Language Bindings

Language	Package	Install
Rust	html-to-markdown-rs	`cargo add html-to-markdown-rs`
Python	html-to-markdown	`pip install html-to-markdown`
TypeScript / Node.js	@kreuzberg/html-to-markdown-node	`npm install @kreuzberg/html-to-markdown-node`
WebAssembly	@kreuzberg/html-to-markdown-wasm	`npm install @kreuzberg/html-to-markdown-wasm`
Ruby	html-to-markdown	`gem install html-to-markdown`
PHP	kreuzberg-dev/html-to-markdown	`composer require kreuzberg-dev/html-to-markdown`
Go	htmltomarkdown	`go get github.com/kreuzberg-dev/html-to-markdown/packages/go/v3`
Java	dev.kreuzberg:html-to-markdown	Maven / Gradle
C#	KreuzbergDev.HtmlToMarkdown	`dotnet add package KreuzbergDev.HtmlToMarkdown`
Elixir	html_to_markdown	`mix deps.get html_to_markdown`
R	htmltomarkdown	`install.packages("htmltomarkdown")`
C (FFI)	releases	Pre-built `.so` / `.dll` / `.dylib`

Part of Kreuzberg.dev

Kreuzberg — document intelligence: text, tables, metadata from 90+ formats with optional OCR.
Kreuzberg Cloud — managed extraction API with SDKs, dashboards, and observability.
kreuzcrawl — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
alef — the polyglot binding generator that produces all per-language bindings.
Discord — community, roadmap, announcements.

Contributing

Contributions welcome! See CONTRIBUTING.md for setup instructions and guidelines.

License

MIT License — see LICENSE for details.

html-to-markdown-cli 3.6.0-rc.20