h2m 0.6.0

HTML to Markdown converter.
Documentation

H2M

Crates.io Docs.rs CI License Rust

Fast, extensible HTML-to-Markdown converter with optional web search — CommonMark + GFM, plugin architecture.

H2M converts HTML into clean Markdown with full CommonMark compliance and GitHub Flavored Markdown extensions. It uses a plugin-based rule system, supports reference-style links, relative URL resolution, and ships with an async CLI that can also search the web and pipe results through the same conversion pipeline (compatible with SearXNG, Brave Search, and Tavily).

Quick Start

Install the CLI

Shell (macOS / Linux):

curl -fsSL https://sh.qntx.fun/h2m | sh

PowerShell (Windows):

irm https://sh.qntx.fun/h2m/ps | iex

Or via Cargo:

cargo install h2m-cli

CLI Structure

H2M uses a subcommand tree:

h2m <COMMAND> [OPTIONS] ...

Commands:
  convert  Convert HTML to Markdown (URL, file, stdin)
  search   Search the web and optionally scrape each hit to Markdown

convert — HTML → Markdown

h2m convert https://example.com
h2m convert page.html
curl -s https://example.com | h2m convert
echo '<h1>Hi</h1>' | h2m convert

Content extraction:

h2m convert -r https://blog.example.com/post          # smart readable
h2m convert -s article https://blog.example.com/post  # CSS selector
h2m convert -s '#content' https://example.com         # by ID

JSON output (agents / programmatic use):

h2m convert --json https://example.com                # pretty JSON
h2m convert --json --extract-links https://example.com
h2m convert --json url1 url2 url3                     # NDJSON streaming
h2m convert --json --urls urls.txt -j 8 --delay 100

Formatting:

h2m convert --gfm https://example.com                 # tables, strikethrough, task lists
h2m convert --link-style referenced page.html         # reference-style links
h2m convert --heading-style setext page.html          # === / --- underlines
h2m convert --user-agent "MyBot/1.0" https://example.com
h2m convert -o output.md https://example.com

search — Web search

H2M supports three search providers. Pick one via --provider or the H2M_SEARCH_PROVIDER environment variable:

Provider Requires Free tier Notes
SearXNG H2M_SEARXNG_URL yes (self-host) Default. Open-source meta-search
Brave BRAVE_API_KEY $5/month credit Independent index
Tavily TAVILY_API_KEY 1000 req/month AI-tuned snippets

Pure search (returns titles/URLs/descriptions):

# Point at any SearXNG instance (self-host or public)
export H2M_SEARXNG_URL=https://searx.example.org

h2m search "rust async trait"                    # pretty JSON response
h2m search "rust async trait" --json             # NDJSON (one hit per line)
h2m search "rust" --limit 5 --time-range week
h2m search "rust" --sources web,news --country us
h2m search "rust" --provider brave               # switch provider

Search + scrape (runs every hit through the full convert pipeline, streams NDJSON ScrapeResults):

h2m search "rust async" --scrape                 # raw markdown per hit
h2m search "rust async" --scrape --gfm --readable
h2m search "rust async" --scrape --selector article
h2m search "rust" --scrape -j 8 --timeout 20     # parallel scrape

JSON Output

convert single URL (pretty JSON):

{
  "markdown": "# Example Domain\n\n...",
  "metadata": {
    "title": "Example Domain",
    "description": "This domain is for use in illustrative examples.",
    "language": "en",
    "ogImage": "https://example.com/og.png",
    "sourceUrl": "https://example.com",
    "url": "https://example.com/",
    "statusCode": 200,
    "contentType": "text/html; charset=UTF-8",
    "elapsedMs": 234
  },
  "links": ["https://example.com/about"]
}

search response:

{
  "query": "rust async",
  "provider": "searxng",
  "web": [
    {"title": "Rust", "url": "https://rust-lang.org", "description": "...", "engine": "duckduckgo"}
  ],
  "news": [],
  "images": [],
  "elapsedMs": 312
}

Multiple inputs (convert batch, or search --scrape) stream NDJSON — one JSON object per line.

Library Usage

// One-liner with CommonMark defaults
let md = h2m::convert("<h1>Hello</h1><p>World</p>");
assert_eq!(md, "# Hello\n\nWorld");
// Full control with the builder
use h2m::{Converter, Options};
use h2m::plugins::Gfm;
use h2m::rules::CommonMark;

let converter = Converter::builder()
    .options(Options::default())
    .use_plugin(&CommonMark)
    .use_plugin(&Gfm)
    .domain("example.com")
    .build();

let md = converter.convert(r#"<a href="/about">About</a>"#);
assert_eq!(md, "[About](https://example.com/about)");

Async Scraping

Enable the scrape feature for async HTTP scraping with built-in concurrency control, rate limiting, and streaming output:

use h2m::scrape::Scraper;

let scraper = Scraper::builder()
    .concurrency(8)
    .gfm(true)
    .extract_links(true)
    .build()?;

let result = scraper.scrape("https://example.com").await?;
println!("{}", result.markdown);

let urls = vec!["https://a.com".into(), "https://b.com".into()];
scraper.scrape_many_streaming(&urls, |result| {
    match result {
        Ok(r) => println!("{}", r.markdown),
        Err(e) => eprintln!("error: {e}"),
    }
}).await;

Web Search

The h2m-search crate exposes the same provider abstraction the CLI uses:

use h2m_search::{SearchClient, SearchQuery};

let client = SearchClient::builder()
    .provider("searxng")
    .searxng_url("https://searx.example.org")
    .build()?;

let response = client
    .search(&SearchQuery::new("rust async").with_limit(5))
    .await?;

for hit in &response.web {
    println!("{}{}", hit.title, hit.url);
}
# Ok::<_, Box<dyn std::error::Error>>(())

Design

  • CommonMark + GFM — full spec compliance with tables, strikethrough, task lists, reference-style links
  • Plugin architecture — extend with custom rules via the Rule trait
  • Async batch pipelinetokio + reqwest, semaphore concurrency, streaming NDJSON (scrape feature)
  • Multi-provider searchSearchClient enum with static dispatch, one Cargo feature per provider
  • Search + scrape compositionsearch --scrape funnels hits through the same Scraper pipeline, reusing all formatting / extraction flags
  • JSON output — nested camelCase metadata aligned with Firecrawl conventions
  • Smart readable extraction — two-phase content detection: semantic selectors → noise stripping
  • Zero-copy fast pathsCow<str> escaping, zero unsafe, Send + Sync

Supported HTML Elements

CommonMark (built-in)

Element Markdown Output
<h1>-<h6> # Heading (ATX) or underline (Setext)
<p>, <div>, <section>, <article> Block paragraph
<strong>, <b> **bold**
<em>, <i> *italic*
<code>, <kbd>, <samp>, <tt> `inline code`
<pre><code> Fenced code block with language detection
<a href="..."> [text](url) or reference-style
<img src="..." alt="..."> ![alt](src "title")
<ul>, <ol>, <li> Bullet/numbered lists with nesting
<blockquote> > quoted text
<hr> ---
<br> Hard line break
<iframe> [iframe](url)

GFM Extensions (with --gfm)

Element Markdown Output
<table> GFM pipe table with alignment
<del>, <s>, <strike> ~~strikethrough~~
<input type="checkbox"> [x] or [ ] (task list)

Auto-removed

Element Behavior
<script> Removed (content stripped)
<style> Removed (content stripped)
<noscript> Removed (content stripped)

Custom Rules

Extend the converter with your own rules by implementing the Rule trait:

use h2m::{Converter, Rule, Action, Context};
use h2m::rules::CommonMark;
use scraper::ElementRef;

#[derive(Debug)]
struct HighlightRule;
impl Rule for HighlightRule {
    fn tags(&self) -> &'static [&'static str] { &["mark"] }

    fn apply(&self, content: &str, _el: &ElementRef<'_>, _ctx: &mut Context<'_>) -> Action {
        Action::Replace(format!("=={content}=="))
    }
}

let mut builder = Converter::builder()
    .use_plugin(CommonMark);
builder.add_rule(HighlightRule);
let converter = builder.build();

let md = converter.convert("<p>This is <mark>important</mark></p>");
assert!(md.contains("==important=="));

Feature Flags

h2m crate:

  • scrape — async HTTP scraping (adds tokio, reqwest, serde)

h2m-search crate:

  • searxng (default) — SearXNG provider
  • brave — Brave Search API provider
  • tavily — Tavily API provider
  • all — all providers

h2m-cli binary:

  • search (default) — enables the search subcommand
  • all-providers — bundles every provider (used for release builds)

Migration from 0.5

v0.6 introduced a subcommand tree. Update every invocation:

Before (0.5) After (0.6)
h2m https://example.com h2m convert https://example.com
h2m --gfm page.html h2m convert --gfm page.html
curl -s URL | h2m -r curl -s URL | h2m convert -r
(new) h2m search "query" --scrape

All convert flags are unchanged; only the leading subcommand is required.

License

Licensed under either of:

at your option.

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this project shall be dual-licensed as above, without any additional terms or conditions.


A QNTX open-source project.

Code is law. We write both.