H2M
Fast, extensible HTML-to-Markdown converter with optional web search — CommonMark + GFM, plugin architecture.
H2M converts HTML into clean Markdown with full CommonMark compliance and GitHub Flavored Markdown extensions. It uses a plugin-based rule system, supports reference-style links, relative URL resolution, and ships with an async CLI that can also search the web and pipe results through the same conversion pipeline. Search works zero-config out of the box via DuckDuckGo and Wikipedia, and also integrates with SearXNG, Brave Search, and Tavily.
Quick Start
Install the CLI
Shell (macOS / Linux):
|
PowerShell (Windows):
irm https://sh.qntx.fun/h2m/ps | iex
Or via Cargo:
CLI Structure
H2M uses a subcommand tree:
h2m <COMMAND> [OPTIONS] ...
Commands:
convert Convert HTML to Markdown (URL, file, stdin)
search Search the web and optionally scrape each hit to Markdown
convert — HTML → Markdown
|
|
Content extraction:
JSON output (agents / programmatic use):
Formatting:
search — Web search
H2M ships with five search providers. The default is duckduckgo, which
requires no API key, no registration, and no environment variables:
| Provider | Requires | Free tier | Notes |
|---|---|---|---|
| DuckDuckGo | - (zero-config) | unlimited* | Default. HTML scraping + lite fallback |
| Wikipedia | - (zero-config) | unlimited | Official MediaWiki API, 300+ languages |
| SearXNG | H2M_SEARXNG_URL |
yes (self-host) | Open-source meta-search |
| Brave | BRAVE_API_KEY |
$5/month credit | Independent index, transparent pagination |
| Tavily | TAVILY_API_KEY |
1000 req/month | AI-tuned snippets + LLM answers |
* DuckDuckGo uses unauthenticated HTML endpoints. Aggressive or datacenter
traffic may trigger anti-bot challenges; the provider auto-falls back to
lite.duckduckgo.com and emits a structured "kind":"captchaDetected" error
so you can automate provider switching. Wikipedia is the recommended fallback.
Zero-config usage (nothing to configure, runs immediately):
All the usual flags work uniformly across providers:
Provider-specific keys (opt-in, via env vars or flags):
;
;
;
Tips:
- CAPTCHA handling — when DuckDuckGo returns
"kind":"captchaDetected"or"authFailed", switch to--provider wikipediaor a keyed provider. The error JSON always carrieskind/provider/statusfields. - Windows + system proxy — if your system proxy intercepts
localhostrequests (Clash/V2Ray/etc), setNO_PROXY=127.0.0.1,localhostbefore pointingh2mat a self-hosted SearXNG instance. - Brave pagination —
--limitup to 200 is supported (Brave caps at 20 per page;h2mpaginates transparently viaoffset).
Search + scrape (runs every hit through the full convert pipeline,
streams NDJSON ScrapeResults):
A ready-made end-to-end smoke test lives at scripts/live_search_e2e.ps1
(Windows PowerShell) — it exercises DuckDuckGo and Wikipedia across English /
Chinese / Japanese and prints a classified summary table.
JSON Output
convert single URL (pretty JSON):
search response:
answer— LLM-generated summary (Tavily--include-answerflag, opt-in).score— relevance in[0, 1](Tavily only; other providers omit it).engine— upstream backend name (SearXNG only; aggregators omit it).
Fields marked Option are dropped from the JSON when absent, keeping output lean.
Multiple inputs (convert batch, or search --scrape) stream NDJSON — one JSON object per line.
Library Usage
// One-liner with CommonMark defaults
let md = convert;
assert_eq!;
// Full control with the builder
use ;
use Gfm;
use CommonMark;
let converter = builder
.options
.use_plugin
.use_plugin
.domain
.build;
let md = converter.convert;
assert_eq!;
Async Scraping
Enable the scrape feature for async HTTP scraping with built-in concurrency control, rate limiting, and streaming output:
use Scraper;
let scraper = builder
.concurrency
.gfm
.extract_links
.build?;
let result = scraper.scrape.await?;
println!;
let urls = vec!;
scraper.scrape_many_streaming.await;
Web Search
The h2m-search crate exposes the same provider abstraction the CLI uses.
The zero-config default is DuckDuckGo; no builder configuration required:
use ;
// Zero-config: uses DuckDuckGo (no API key, no env vars).
let client = builder.build?;
let response = client
.search
.await?;
for hit in &response.web
# Ok::
Design
- CommonMark + GFM — full spec compliance with tables, strikethrough, task lists, reference-style links
- Plugin architecture — extend with custom rules via the
Ruletrait - Async batch pipeline —
tokio+reqwest, semaphore concurrency, streaming NDJSON (scrapefeature) - Multi-provider search —
SearchClientenum with static dispatch, one Cargo feature per provider - Search + scrape composition —
search --scrapefunnels hits through the sameScraperpipeline, reusing all formatting / extraction flags - JSON output — nested camelCase metadata aligned with Firecrawl conventions
- Smart readable extraction — two-phase content detection: semantic selectors → noise stripping
- Zero-copy fast paths —
Cow<str>escaping, zerounsafe,Send + Sync
Supported HTML Elements
CommonMark (built-in)
| Element | Markdown Output |
|---|---|
<h1>-<h6> |
# Heading (ATX) or underline (Setext) |
<p>, <div>, <section>, <article> |
Block paragraph |
<strong>, <b> |
**bold** |
<em>, <i> |
*italic* |
<code>, <kbd>, <samp>, <tt> |
`inline code` |
<pre><code> |
Fenced code block with language detection |
<a href="..."> |
[text](url) or reference-style |
<img src="..." alt="..."> |
 |
<ul>, <ol>, <li> |
Bullet/numbered lists with nesting |
<blockquote> |
> quoted text |
<hr> |
--- |
<br> |
Hard line break |
<iframe> |
[iframe](url) |
GFM Extensions (with --gfm)
| Element | Markdown Output |
|---|---|
<table> |
GFM pipe table with alignment |
<del>, <s>, <strike> |
~~strikethrough~~ |
<input type="checkbox"> |
[x] or [ ] (task list) |
Auto-removed
| Element | Behavior |
|---|---|
<script> |
Removed (content stripped) |
<style> |
Removed (content stripped) |
<noscript> |
Removed (content stripped) |
Custom Rules
Extend the converter with your own rules by implementing the Rule trait:
use ;
use CommonMark;
use ElementRef;
;
let mut builder = builder
.use_plugin;
builder.add_rule;
let converter = builder.build;
let md = converter.convert;
assert!;
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
- MIT License (LICENSE-MIT or https://opensource.org/licenses/MIT)
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this project shall be dual-licensed as above, without any additional terms or conditions.
A QNTX open-source project.
Code is law. We write both.