supermarkdown
High-performance HTML to Markdown converter with full GitHub Flavored Markdown support. Written in Rust, available for Node.js and as a native Rust crate.
Features
- Fast - Written in Rust with O(n) algorithms, significantly faster than JavaScript alternatives
- Full GFM Support - Tables with alignment, strikethrough, autolinks, fenced code blocks
- Accurate - Handles malformed HTML gracefully via html5ever
- Configurable - Multiple heading styles, link styles, custom selectors
- Zero Dependencies - Single native binary, no JavaScript runtime overhead
- Cross-Platform - Pre-built binaries for Windows, macOS, and Linux (x64 & ARM64)
- TypeScript Ready - Full type definitions included
- Async Support - Non-blocking conversion for large documents
Installation
Node.js
Rust
CLI
Install the CLI binary via cargo:
Command Line Usage
The CLI allows you to convert HTML files from the command line or via stdin:
# Convert a file
# Pipe HTML from curl
|
# Exclude navigation and ads
# Use setext-style headings and referenced links
CLI Options
| Option | Description |
|---|---|
-h, --help |
Print help message |
-v, --version |
Print version |
--heading-style <STYLE> |
atx (default) or setext |
--link-style <STYLE> |
inline (default) or referenced |
--code-fence <CHAR> |
` (default) or ~ |
--bullet <CHAR> |
- (default), *, or + |
--exclude <SELECTORS> |
CSS selectors to exclude (comma-separated) |
Quick Start
import from "@vakra-dev/supermarkdown";
const html = `
<h1>Hello World</h1>
<p>This is a <strong>test</strong> with a <a href="https://example.com">link</a>.</p>
`;
const markdown = ;
console.log;
// # Hello World
//
// This is a **test** with a [link](https://example.com).
Common Use Cases
Cleaning Web Scrapes
When scraping websites, HTML often contains navigation, ads, and other non-content elements. Use selectors to extract only what you need:
import from "@vakra-dev/supermarkdown";
// Raw HTML from a web scrape
const scrapedHtml = await ;
// Clean conversion - remove nav, ads, sidebars
const markdown = ;
Preparing Content for LLMs
When feeding web content to LLMs, you want clean, focused text without HTML artifacts:
import from "@vakra-dev/supermarkdown";
// Extract just the article content for RAG pipelines
const markdown = ;
// Now feed to your LLM
const response = await llm.;
Processing Blog Posts
Convert blog HTML while preserving code blocks and formatting:
import from "@vakra-dev/supermarkdown";
const blogHtml = `
<article>
<h1>Getting Started with Rust</h1>
<p>Rust is a systems programming language focused on safety.</p>
<pre><code class="language-rust">fn main() {
println!("Hello, world!");
}</code></pre>
<p>The <code>println!</code> macro prints to stdout.</p>
</article>
`;
const markdown = ;
// Output:
// # Getting Started with Rust
//
// Rust is a systems programming language focused on safety.
//
// ```rust
// fn main() {
// println!("Hello, world!");
// }
// ```
//
// The `println!` macro prints to stdout.
Converting Documentation Pages
Handle tables, definition lists, and nested structures common in docs:
import from "@vakra-dev/supermarkdown";
const docsHtml = `
<h2>API Reference</h2>
<table>
<tr><th>Method</th><th>Description</th></tr>
<tr><td><code>convert()</code></td><td>Sync conversion</td></tr>
<tr><td><code>convertAsync()</code></td><td>Async conversion</td></tr>
</table>
<dl>
<dt>headingStyle</dt>
<dd>ATX (#) or Setext (underlines)</dd>
</dl>
`;
const markdown = ;
// Output:
// ## API Reference
//
// | Method | Description |
// | --- | --- |
// | `convert()` | Sync conversion |
// | `convertAsync()` | Async conversion |
//
// headingStyle
// : ATX (#) or Setext (underlines)
Batch Processing
Process multiple documents efficiently with async conversion:
import from "@vakra-dev/supermarkdown";
const urls = ;
// Fetch and convert in parallel
const markdownDocs = await Promise.;
Usage
Basic Conversion
import from "@vakra-dev/supermarkdown";
const markdown = ;
With Options
import from "@vakra-dev/supermarkdown";
const markdown = ;
Async Conversion
For large documents, use convertAsync to avoid blocking the main thread:
import from "@vakra-dev/supermarkdown";
const markdown = await ;
// Process multiple documents in parallel
const results = await Promise.;
API Reference
convert(html, options?)
Converts HTML to Markdown synchronously.
Parameters:
html(string) - The HTML string to convertoptions(object, optional) - Conversion options
Returns: string - The converted Markdown
convertAsync(html, options?)
Converts HTML to Markdown asynchronously.
Parameters:
html(string) - The HTML string to convertoptions(object, optional) - Conversion options
Returns: Promise - The converted Markdown
Options
| Option | Type | Default | Description |
|---|---|---|---|
headingStyle |
'atx' | 'setext' |
'atx' |
ATX uses # prefix, Setext uses underlines |
linkStyle |
'inline' | 'referenced' |
'inline' |
Inline: [text](url), Referenced: [text][1] |
codeFence |
'`' | '~' |
'`' |
Character for fenced code blocks |
bulletMarker |
'-' | '*' | '+' |
'-' |
Character for unordered list items |
baseUrl |
string |
undefined |
Base URL for resolving relative links |
excludeSelectors |
string[] |
[] |
CSS selectors for elements to exclude |
includeSelectors |
string[] |
[] |
CSS selectors to force keep (overrides excludes) |
Supported Elements
Block Elements
| HTML | Markdown |
|---|---|
<h1> - <h6> |
# headings or setext underlines |
<p> |
Paragraphs with blank lines |
<blockquote> |
> quoted blocks (supports nesting) |
<ul>, <ol> |
- or 1. lists (supports start attribute) |
<pre><code> |
Fenced code blocks with language detection |
<table> |
GFM tables with alignment and captions |
<hr> |
--- horizontal rules |
<dl>, <dt>, <dd> |
Definition lists |
<details>, <summary> |
Collapsible sections |
<figure>, <figcaption> |
Images with captions |
Inline Elements
| HTML | Markdown |
|---|---|
<a> |
[text](url), [text][ref], or <url> (autolink) |
<img> |
 |
<strong>, <b> |
**bold** |
<em>, <i> |
*italic* |
<code> |
`code` (handles nested backticks) |
<del>, <s>, <strike> |
~~strikethrough~~ |
<sub> |
<sub>subscript</sub> |
<sup> |
<sup>superscript</sup> |
<br> |
Line breaks |
HTML Passthrough
Elements without Markdown equivalents are preserved as HTML:
<kbd>- Keyboard input<mark>- Highlighted text<abbr>- Abbreviations (preservestitleattribute)<samp>- Sample output<var>- Variables
Advanced Features
Table Alignment
Extracts alignment from align attribute or text-align style:
Left
Center
Right
Output:
Ordered List Start
Respects the start attribute on ordered lists:
Fifth item
Sixth item
Output:
5. 6.
Autolinks
When a link's text matches its URL or email, autolink syntax is used:
https://example.com
test@example.com
Output:
<https://example.com>
<test@example.com>
Code Block Language Detection
Automatically detects language from class names:
language-*(e.g.,language-rust)lang-*(e.g.,lang-python)highlight-*(e.g.,highlight-go)hljs-*(highlight.js classes, excluding token classes likehljs-keyword)- Bare language names (e.g.,
javascript,python) as fallback
fn main() {}
Output:
```rust
fn main() {}
```
Code blocks containing backticks automatically use more backticks as delimiters.
Line Number Handling
Line number gutters are automatically stripped from code blocks. Elements with these class patterns are skipped:
gutterline-numberline-numberslinenolinenumber
URL Encoding
Spaces and parentheses in URLs are automatically percent-encoded:
// <a href="https://example.com/path (1)">link</a>
// → [link](https://example.com/path%20%281%29)
Selector-Based Filtering
Remove unwanted elements like navigation, ads, or sidebars:
const markdown = ;
Limitations
Some HTML features cannot be fully represented in Markdown:
| Feature | Behavior |
|---|---|
| Table colspan/rowspan | Content placed in first cell |
| Nested tables | Inner tables converted inline |
| Form elements | Skipped |
| iframe/video/audio | Skipped (no standard Markdown equivalent) |
| CSS styling | Ignored (except text-align for tables) |
| Empty elements | Removed from output |
Edge Cases
supermarkdown handles many edge cases gracefully:
Malformed HTML
Invalid or malformed HTML is parsed via html5ever, which applies browser-like error recovery:
// Missing closing tags, nested issues - all handled
const html = "<p>Unclosed paragraph<div>Mixed<p>nesting</div>";
const markdown = ; // Produces sensible output
Deeply Nested Lists
Nested lists maintain proper indentation:
const html = `
<ul>
<li>Level 1
<ul>
<li>Level 2
<ul>
<li>Level 3</li>
</ul>
</li>
</ul>
</li>
</ul>`;
// Output:
// - Level 1
// - Level 2
// - Level 3
Code Blocks with Backticks
When code contains backticks, the fence automatically uses more backticks:
const html = "<pre><code>Use `backticks` for code</code></pre>";
// Output uses 4 backticks as fence:
// ````
// Use `backticks` for code
// ````
Empty Elements
Empty paragraphs, divs, and spans are stripped to avoid blank lines:
const html = "<p></p><p>Real content</p><p> </p>";
const markdown = ;
// Output: "Real content" (empty paragraphs removed)
Special Characters in URLs
Spaces, parentheses, and other special characters in URLs are percent-encoded:
const html = '<a href="https://example.com/file (1).pdf">Download</a>';
// Output: [Download](https://example.com/file%20%281%29.pdf)
Tables Without Headers
Tables missing <thead> use the first row as header:
const html = `
<table>
<tr><td>A</td><td>B</td></tr>
<tr><td>1</td><td>2</td></tr>
</table>`;
// Output:
// | A | B |
// | --- | --- |
// | 1 | 2 |
Mixed Content in Lists
List items with mixed block/inline content are handled:
const html = `
<ul>
<li>Simple item</li>
<li>
<p>Paragraph in list</p>
<pre><code>code block</code></pre>
</li>
</ul>`;
// Outputs proper markdown with preserved formatting
Troubleshooting
Empty or Minimal Output
Problem: convert() returns empty string or very little content.
Causes & Solutions:
-
Content is in excluded elements - Check if your content is inside
nav,header, etc. that might match default patterns// Try without selectors first const markdown = ; -
JavaScript-rendered content - supermarkdown converts static HTML only. If the page uses client-side rendering, you need to render it first (e.g., with Puppeteer or Playwright)
-
Content in iframes - iframe content is not extracted. Fetch iframe src separately if needed
Missing Code Block Language
Problem: Code blocks don't have language annotation.
Solution: supermarkdown looks for language-*, lang-*, or highlight-* class patterns. Ensure your HTML uses standard class naming:
<!-- Detected -->
...
...
<!-- Not detected -->
...
Tables Not Rendering Correctly
Problem: Tables appear as plain text or are malformed.
Causes & Solutions:
- Missing table structure - Ensure proper
<table>,<tr>,<td>structure - Nested tables - GFM doesn't support nested tables; inner tables are flattened
- colspan/rowspan - These are not supported in GFM; content goes in first cell
Links Missing or Broken
Problem: Links don't appear or have wrong URLs.
Solutions:
-
Relative URLs - Use
baseUrloption to resolve relative links:; -
Links in excluded elements - Navigation links are often in
<nav>which may be excluded
Performance Issues with Large Documents
Problem: Conversion is slow for very large HTML files.
Solutions:
- Use async -
convertAsync()won't block the event loop - Pre-filter HTML - Remove obvious non-content before conversion
- Stream processing - For very large docs, consider splitting into sections
Special Characters Appearing Wrong
Problem: Characters like <, >, & appear as entities.
Solution: This is usually correct behavior - these characters need escaping in markdown. If you're seeing & where you expect &, the source HTML may have double-encoded entities.
Rust Usage
Add to your Cargo.toml:
[]
= "0.0.2"
use ;
// Basic conversion
let markdown = convert;
// With options
let options = new
.heading_style
.exclude_selectors;
let markdown = convert_with_options;
Performance
supermarkdown is designed for high performance:
- Single-pass parsing - O(n) HTML traversal
- Pre-computed metadata - List indices and CSS selectors computed in one pass
- Zero-copy where possible - Minimal string allocations
- Native code - No JavaScript runtime overhead
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
# Clone the repository
# Run tests
# Build Node.js bindings
License
MIT License - see LICENSE for details.