html-to-markdown
Fast, robust HTML → Markdown for 16 languages. A tiered converter that picks the safest, fastest path per input without losing content.
What and Why?
html-to-markdown converts real-world HTML — unclosed tags, CDATA, custom elements, malformed entities, nested tables, mixed encodings — into clean CommonMark (or Djot) without losing content, from one Rust core with native bindings for 16 languages.
It routes each input through three tiers: a single-pass byte scanner for clean HTML, a tolerant DOM walker for complex inputs, and an html5ever repair pass for malformed HTML — with byte-identical output across tiers, enforced by a 116-snapshot oracle and per-group performance gates in CI. The dispatcher is invisible: the same convert() call works regardless of which tier runs.
Features
| Feature | Description |
|---|---|
| 16 languages, one Rust core | Rust, Python, Node.js, WASM, Java, Go, C#, PHP, Ruby, Elixir, R, Dart, Kotlin (Android), Swift, Zig, and a C ABI |
| Tiered dispatch | Byte scanner → DOM walker → html5ever repair, with byte-equal output across tiers |
| Real-HTML robust | Unclosed tags, CDATA, custom elements, malformed entities, nested tables, mixed encodings — handled without losing content |
| GFM tables | Padded cells, alignment, and pipe escaping |
| Djot output | Set output_format = "djot" to emit Djot instead of Markdown |
| Metadata extraction | Parse <head> into structured metadata (Open Graph, Twitter, JSON-LD, microdata, RDFa, header hierarchy) |
| Inline images | Opt-in mirroring of data URIs and remote image references |
| Visitor API | Feature-gated traversal to transform the converted Markdown AST |
| Configurable preprocessing | Standard, strict, and lenient presets — or build your own |
| Fast | 19–116 MB/s on the Wikipedia/mdream corpus; per-group regression thresholds enforced on every PR |
Quick Start
convert() is the single entry point — it returns a structured result with content, warnings, and optional metadata.
Language Packages
See Rust README for full documentation.
See Python README for full documentation.
See Node.js README for full documentation.
See Go README for full documentation.
Available on Maven Central as dev.kreuzberg:html-to-markdown. See Java README for the dependency snippet and current version.
See C# README for full documentation.
See Ruby README for full documentation.
See PHP README for full documentation.
Add {:html_to_markdown, "~> 3.6"} to your mix.exs dependencies. See Elixir README for full documentation.
See R README for full documentation.
See Dart README for full documentation.
Available on Maven Central as dev.kreuzberg:html-to-markdown-android. See Kotlin README for the dependency snippet and current version.
Add via Swift Package Manager. See Swift README for full documentation.
See Zig README for installation and usage.
See WebAssembly README for full documentation.
Pre-built .so / .dll / .dylib from GitHub Releases. See FFI crate for full documentation.
See CLI usage for full documentation.
AI Coding Assistants
Install the html-to-markdown plugin from the kreuzberg-dev/plugins marketplace. It ships the html-to-markdown agent skills and works with every major coding agent — expand your harness below.
/plugin marketplace add kreuzberg-dev/plugins
/plugin install html-to-markdown@kreuzberg
/plugins add https://github.com/kreuzberg-dev/plugins
Then search for html-to-markdown and select Install Plugin.
Settings → Plugins → Add from URL → https://github.com/kreuzberg-dev/plugins, then select html-to-markdown.
gemini extensions install https://github.com/kreuzberg-dev/plugins
droid plugin marketplace add https://github.com/kreuzberg-dev/plugins
droid plugin install html-to-markdown@kreuzberg
copilot plugin marketplace add https://github.com/kreuzberg-dev/plugins
copilot plugin install html-to-markdown@kreuzberg
Add the package to opencode.json:
Documentation
Full guides, the convert() API for every binding, tier architecture, the metadata and visitor APIs, and performance benchmarks live at docs.html-to-markdown.kreuzberg.dev.
Part of Kreuzberg.dev
- Kreuzberg — document intelligence: text, tables, metadata from 91+ formats with optional OCR.
- Kreuzberg Cloud — managed extraction API with SDKs, dashboards, and observability.
- kreuzcrawl — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
- html-to-markdown — fast, lossless HTML→Markdown engine.
- liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
- tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
- alef — the polyglot binding generator that produces every per-language binding across the 5 polyglot repos.
Contributing
Contributions welcome! See CONTRIBUTING.md for setup instructions and guidelines.
License
MIT License — see LICENSE for details.