decruft
Extract clean, readable content from web pages. A Rust port of defuddle.
What it does
Given a noisy HTML page (ads, navigation, sidebars, popups, tracking pixels, cookie banners...), decruft extracts the main content and metadata:
- Content — the article/post body as clean HTML
- Metadata — title, author, published date, description, image, language, site name, favicon
- Schema.org — parsed JSON-LD data
Install
Or add to your Cargo.toml:
[]
= "0.1"
CLI
# Fetch a URL (auto-detected)
# From file
# From stdin
|
# Output formats: json (default), html, text, markdown
# Debug mode (shows what was removed and why)
|
Options
Usage: decruft [OPTIONS] [INPUT]
Arguments:
[INPUT] Path to an HTML file to process. Use - for stdin [default: -]
Options:
-u, --url <URL> URL (for resolving relative URLs and metadata)
-s, --selector <SELECTOR> CSS selector to override content root detection
-f, --format <FORMAT> Output format: json, html, text, or markdown [default: json]
-d, --debug Include removal details in output
--no-images Strip all images
--no-exact-selectors Disable exact CSS selector removal
--no-partial-selectors Disable partial class/id pattern removal
--no-hidden Disable hidden element removal
--no-scoring Disable content scoring removal
--no-patterns Disable content pattern removal
--no-standardize Disable content standardization
--no-replies Exclude replies/comments from extracted content
-h, --help Print help
-V, --version Print version
Library
Quick start
// One-liner with defaults
let result = parse_with_defaults;
assert!;
With options
use ;
let html = r#"<html>
<head><title>My Article - Blog Name</title></head>
<body>
<nav><a href="/">Home</a></nav>
<article>
<h1>My Article</h1>
<p>The actual content you want.</p>
</article>
<footer>Copyright 2025</footer>
</body>
</html>"#;
let mut options = default;
options.url = Some;
let result = parse;
assert_eq!;
assert!;
assert!;
Result structure
DecruftResult metadata fields (title, author, published, description, image, language, domain, favicon, site, canonical_url, content_type, modified) are Option<String> — absent metadata is None, not an empty string. In JSON output, None fields are omitted entirely.
content is always a String (cleaned HTML). Set options.markdown = true to get markdown in content instead. Set options.separate_markdown = true to keep HTML in content and also get markdown in content_markdown.
Fetching pages
The fetch_page function is available for fetching URLs with browser-like defaults:
use ;
let html = fetch_page.unwrap;
let mut options = default;
options.url = Some;
let result = parse;
What gets removed
| Category | Examples |
|---|---|
| Ads | .ad, [data-ad-wrapper], .adsense, .promo |
| Navigation | <nav>, .menu, .navbar, [role="navigation"] |
| Sidebars | <aside>, .sidebar, [role="complementary"] |
| Social | .share, .social, share buttons, follow widgets |
| Comments | #comments, .comments-section |
| Footers | <footer>, copyright notices |
| Popups | .modal, .overlay, .popup, cookie banners |
| Hidden | display:none, visibility:hidden, [hidden] |
| Metadata clutter | Bylines, read time, breadcrumbs, tags, TOC |
| Related content | "You might also like", "More stories", card grids |
| Newsletter CTAs | Subscribe forms, email signup blocks |
Extraction pipeline
- Parse HTML and extract schema.org JSON-LD
- Extract metadata (title, author, date, etc.) via priority chains across meta tags, schema.org, and DOM
- Try site-specific extractors (GitHub, Reddit, Hacker News, X/Twitter, Substack, C2 Wiki, BBCode, AI chat conversations)
- Find main content element using scored entry-point selectors
- Standardize math, footnotes, callouts, and code blocks into canonical formats
- Remove ads, navigation, sidebars, and other clutter via CSS selectors
- Remove elements matching ~500 partial class/id patterns
- Score and remove non-content blocks (link-dense, nav indicators)
- Remove content patterns (bylines, read time, boilerplate, related posts)
- Standardize output (clean attributes, normalize headings, resolve URLs, deduplicate images)
- Retry with progressively relaxed filters if too little content was extracted
Metadata priority chains
Each metadata field is extracted using a fallback chain:
- Title:
og:title>twitter:title> schema.orgheadline><meta name="title">><title> - Author:
<meta name="author">> schema.orgauthor.name>[itemprop="author"]>.author - Published: schema.org
datePublished>article:published_time><time>element - Description:
<meta name="description">>og:description>twitter:description> schema.org - Image:
og:image>twitter:image> schema.orgimage - Language:
<html lang>>content-languagemeta >og:locale
Acknowledgements
Inspired by defuddle by Steph Ango. The selector lists, scoring heuristics, and extraction pipeline are adapted from defuddle's approach.
Test fixtures include pages from defuddle's test suite and Mozilla's Readability.js test suite (via readabilityrs).
License
MIT