html-to-markdown-cli 3.6.0-rc.20

Command-line interface for html-to-markdown - high-performance HTML to Markdown converter
html-to-markdown-cli-3.6.0-rc.20 is not a library.

html-to-markdown

Fast, robust HTML → Markdown for 16 languages. A tiered converter that picks the safest fastest path per input without losing content.

Documentation | API Reference

Highlights

  • 16 languages, one Rust core. Rust, Python, Node.js, WASM, Java, Go, C#, PHP, Ruby, Elixir, R, Dart, Kotlin (Android), Swift, Zig, C ABI.
  • Tiered dispatch. Byte scanner for clean HTML → DOM walker for complex inputs → html5ever repair for malformed HTML. Byte-equal output across tiers.
  • Real-HTML robust. Unclosed tags, CDATA, custom elements, malformed entities, nested tables, mixed encodings — handled without losing content.
  • GFM tables, Djot output, metadata extraction, visitor API, inline images, configurable preprocessing presets.
  • CommonMark-compatible Markdown with GFM-style tables. output_format = "djot" switches to Djot.
  • 116-snapshot oracle + per-group regression gates in CI. Performance and correctness both enforced.

Architecture

The converter routes each input through one of three tiers based on a fast prescan of the byte stream:

  1. Tier-1 — Single-pass byte scanner. Walks html.as_bytes() once and emits Markdown directly. Handles 110+ HTML tags including paragraphs, headings, lists, GFM tables, links, images, inline emphasis, blockquotes, indented code blocks. Bails (returns a structured error) on any construct it cannot prove byte-equivalent to Tier-2 — custom elements, CDATA, malformed entities, nested tables, mixed table sections, multi-line table cells, etc.

  2. Tier-2 — tl::parse DOM walker. Picks up Tier-1's bails and inputs the classifier rejected up front (non-default style options, non-Markdown output, etc.). Handles the full HTML5 spec via a tolerant DOM walk.

  3. Tier-3 — html5ever standards-conformant parser. Engaged when Tier-2 detects HTML requiring full HTML5 repair (custom elements, structural recovery, weird namespace transitions).

The dispatcher is invisible to the caller. The same convert() call works regardless of which tier handled the input; the output is byte-identical across tiers (enforced by a 116-snapshot oracle).

Performance

Best-of-3 measurements on the harness corpus (Apple Silicon, cargo build --release):

Fixture Size ms (best of 3) Throughput
wikipedia/medium_python.html 1.24 MB 62.58 ms 19.0 MB/s
wikipedia/large_rust.html 1.07 MB 37.17 ms 27.3 MB/s
wikipedia/small_html.html 973 KB 29.32 ms 31.6 MB/s
wikipedia/tables_countries.html 756 KB 18.95 ms 38.1 MB/s
mdream/github-markdown-complete.html 430 KB 10.57 ms 38.7 MB/s
mdream/react-learn.html 265 KB 12.11 ms 20.9 MB/s
mdream/wikipedia-small.html 166 KB 5.63 ms 28.1 MB/s
issues/gh-121-hacker-news.html 57 KB 1.08 ms 50.3 MB/s
mdream/nuxt-example.html 3.6 KB 0.029 ms 116.1 MB/s

Corpus: 29 fixtures totalling 6.4 MB across clean_small, clean_medium, clean_large, spec_rules, adversarial, and fallthrough_* groups. Per-group regression thresholds (5–30%) are enforced on every PR via task bench:compare. Run task bench:run to reproduce on your hardware.

Capabilities

  • HTML element coverage: 110+ tags handled natively in Tier-1; full HTML5 coverage via Tier-2/Tier-3 fallback.
  • GFM-style tables with padded cells, alignment, and pipe escaping.
  • Djot output: set ConversionOptions { output_format: OutputFormat::Djot, .. } to emit Djot instead of Markdown.
  • Metadata extraction: parse <head> into structured HtmlMetadata (open-graph, twitter, JSON-LD, microdata, RDFa, header hierarchy).
  • Inline images: opt-in via inline-images feature; mirrors data URIs and remote image references.
  • Visitor API: feature-gated traversal that lets callers transform the converted Markdown AST (visitor feature).
  • Configurable preprocessing: standard, strict, lenient presets — or build your own.
  • Tiered fallback: Tier-3 (html5ever) handles inputs the other tiers cannot, so the converter never silently corrupts malformed HTML.

Quick Start

# Rust
cargo add html-to-markdown-rs

# Python
pip install html-to-markdown

# TypeScript / Node.js
npm install @kreuzberg/html-to-markdown-node

# Ruby
gem install html-to-markdown

# CLI
cargo install html-to-markdown-cli
# or
brew install kreuzberg-dev/tap/html-to-markdown

See the package READMEs for the full list: PHP, Go, Java, C#, Elixir, R, Dart, Kotlin (Android), Swift, Zig, WASM, and a C ABI for everything else.

Usage

convert() is the single entry point. It returns a structured ConversionResult:

# Python
from html_to_markdown import convert

result = convert("<h1>Hello</h1><p>World</p>")
print(result.content)        # # Hello\n\nWorld
print(result.metadata)       # title, links, headings, …
// TypeScript / Node.js
import { convert } from "@kreuzberg/html-to-markdown";

const result = convert("<h1>Hello</h1><p>World</p>");
console.log(result.content); // # Hello\n\nWorld
console.log(result.metadata); // title, links, headings, …
// Rust
use html_to_markdown_rs::convert;

let result = convert("<h1>Hello</h1><p>World</p>", None)?;
println!("{}", result.content.unwrap_or_default());

Language Bindings

Language Package Install
Rust html-to-markdown-rs cargo add html-to-markdown-rs
Python html-to-markdown pip install html-to-markdown
TypeScript / Node.js @kreuzberg/html-to-markdown-node npm install @kreuzberg/html-to-markdown-node
WebAssembly @kreuzberg/html-to-markdown-wasm npm install @kreuzberg/html-to-markdown-wasm
Ruby html-to-markdown gem install html-to-markdown
PHP kreuzberg-dev/html-to-markdown composer require kreuzberg-dev/html-to-markdown
Go htmltomarkdown go get github.com/kreuzberg-dev/html-to-markdown/packages/go/v3
Java dev.kreuzberg:html-to-markdown Maven / Gradle
C# KreuzbergDev.HtmlToMarkdown dotnet add package KreuzbergDev.HtmlToMarkdown
Elixir html_to_markdown mix deps.get html_to_markdown
R htmltomarkdown install.packages("htmltomarkdown")
C (FFI) releases Pre-built .so / .dll / .dylib

Part of Kreuzberg.dev

  • Kreuzberg — document intelligence: text, tables, metadata from 90+ formats with optional OCR.
  • Kreuzberg Cloud — managed extraction API with SDKs, dashboards, and observability.
  • kreuzcrawl — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
  • liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
  • tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
  • alef — the polyglot binding generator that produces all per-language bindings.
  • Discord — community, roadmap, announcements.

Contributing

Contributions welcome! See CONTRIBUTING.md for setup instructions and guidelines.

License

MIT License — see LICENSE for details.