H2M
Fast, extensible HTML-to-Markdown converter with optional web search — CommonMark + GFM, plugin architecture.
H2M converts HTML into clean Markdown with full CommonMark compliance and GitHub Flavored Markdown extensions. It uses a plugin-based rule system, supports reference-style links, relative URL resolution, and ships with an async CLI that can also search the web and pipe results through the same conversion pipeline (compatible with SearXNG, Brave Search, and Tavily).
Quick Start
Install the CLI
Shell (macOS / Linux):
|
PowerShell (Windows):
irm https://sh.qntx.fun/h2m/ps | iex
Or via Cargo:
CLI Structure
H2M uses a subcommand tree:
h2m <COMMAND> [OPTIONS] ...
Commands:
convert Convert HTML to Markdown (URL, file, stdin)
search Search the web and optionally scrape each hit to Markdown
convert — HTML → Markdown
|
|
Content extraction:
JSON output (agents / programmatic use):
Formatting:
search — Web search
H2M supports three search providers. Pick one via --provider or the
H2M_SEARCH_PROVIDER environment variable:
| Provider | Requires | Free tier | Notes |
|---|---|---|---|
| SearXNG | H2M_SEARXNG_URL |
yes (self-host) | Default. Open-source meta-search |
| Brave | BRAVE_API_KEY |
$5/month credit | Independent index |
| Tavily | TAVILY_API_KEY |
1000 req/month | AI-tuned snippets |
Pure search (returns titles/URLs/descriptions):
# Point at any SearXNG instance (self-host or public)
Search + scrape (runs every hit through the full convert pipeline,
streams NDJSON ScrapeResults):
JSON Output
convert single URL (pretty JSON):
search response:
Multiple inputs (convert batch, or search --scrape) stream NDJSON — one JSON object per line.
Library Usage
// One-liner with CommonMark defaults
let md = convert;
assert_eq!;
// Full control with the builder
use ;
use Gfm;
use CommonMark;
let converter = builder
.options
.use_plugin
.use_plugin
.domain
.build;
let md = converter.convert;
assert_eq!;
Async Scraping
Enable the scrape feature for async HTTP scraping with built-in concurrency control, rate limiting, and streaming output:
use Scraper;
let scraper = builder
.concurrency
.gfm
.extract_links
.build?;
let result = scraper.scrape.await?;
println!;
let urls = vec!;
scraper.scrape_many_streaming.await;
Web Search
The h2m-search crate exposes the same provider abstraction the CLI uses:
use ;
let client = builder
.provider
.searxng_url
.build?;
let response = client
.search
.await?;
for hit in &response.web
# Ok::
Design
- CommonMark + GFM — full spec compliance with tables, strikethrough, task lists, reference-style links
- Plugin architecture — extend with custom rules via the
Ruletrait - Async batch pipeline —
tokio+reqwest, semaphore concurrency, streaming NDJSON (scrapefeature) - Multi-provider search —
SearchClientenum with static dispatch, one Cargo feature per provider - Search + scrape composition —
search --scrapefunnels hits through the sameScraperpipeline, reusing all formatting / extraction flags - JSON output — nested camelCase metadata aligned with Firecrawl conventions
- Smart readable extraction — two-phase content detection: semantic selectors → noise stripping
- Zero-copy fast paths —
Cow<str>escaping, zerounsafe,Send + Sync
Supported HTML Elements
CommonMark (built-in)
| Element | Markdown Output |
|---|---|
<h1>-<h6> |
# Heading (ATX) or underline (Setext) |
<p>, <div>, <section>, <article> |
Block paragraph |
<strong>, <b> |
**bold** |
<em>, <i> |
*italic* |
<code>, <kbd>, <samp>, <tt> |
`inline code` |
<pre><code> |
Fenced code block with language detection |
<a href="..."> |
[text](url) or reference-style |
<img src="..." alt="..."> |
 |
<ul>, <ol>, <li> |
Bullet/numbered lists with nesting |
<blockquote> |
> quoted text |
<hr> |
--- |
<br> |
Hard line break |
<iframe> |
[iframe](url) |
GFM Extensions (with --gfm)
| Element | Markdown Output |
|---|---|
<table> |
GFM pipe table with alignment |
<del>, <s>, <strike> |
~~strikethrough~~ |
<input type="checkbox"> |
[x] or [ ] (task list) |
Auto-removed
| Element | Behavior |
|---|---|
<script> |
Removed (content stripped) |
<style> |
Removed (content stripped) |
<noscript> |
Removed (content stripped) |
Custom Rules
Extend the converter with your own rules by implementing the Rule trait:
use ;
use CommonMark;
use ElementRef;
;
let mut builder = builder
.use_plugin;
builder.add_rule;
let converter = builder.build;
let md = converter.convert;
assert!;
Feature Flags
h2m crate:
scrape— async HTTP scraping (addstokio,reqwest,serde)
h2m-search crate:
searxng(default) — SearXNG providerbrave— Brave Search API providertavily— Tavily API providerall— all providers
h2m-cli binary:
search(default) — enables thesearchsubcommandall-providers— bundles every provider (used for release builds)
Migration from 0.5
v0.6 introduced a subcommand tree. Update every invocation:
| Before (0.5) | After (0.6) |
|---|---|
h2m https://example.com |
h2m convert https://example.com |
h2m --gfm page.html |
h2m convert --gfm page.html |
curl -s URL | h2m -r |
curl -s URL | h2m convert -r |
| (new) | h2m search "query" --scrape |
All convert flags are unchanged; only the leading subcommand is required.
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
- MIT License (LICENSE-MIT or https://opensource.org/licenses/MIT)
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this project shall be dual-licensed as above, without any additional terms or conditions.
A QNTX open-source project.
Code is law. We write both.