decruft
Extract clean, readable content from web pages. A Rust port of defuddle.
What it does
Given a noisy HTML page (ads, navigation, sidebars, popups, tracking pixels, cookie banners...), decruft extracts the main content and metadata:
- Content — the article/post body as clean HTML
- Metadata — title, author, published date, description, image, language, site name, favicon
- Schema.org — parsed JSON-LD data
Install
Or add to your Cargo.toml:
[]
= "0.1"
CLI
# Fetch a URL (auto-detected)
# From file
# From stdin
|
# Output formats: json (default), html, text, markdown
# Debug mode (shows what was removed and why)
|
Options
Usage: decruft [OPTIONS] [INPUT]
Arguments:
[INPUT] Path to an HTML file to process. Use - for stdin [default: -]
Options:
-u, --url <URL> URL (for resolving relative URLs and metadata)
-s, --selector <SELECTOR> CSS selector to override content root detection
-f, --format <FORMAT> Output format: json, html, text, or markdown [default: json]
-d, --debug Include removal details in output
--no-images Strip all images
--no-exact-selectors Disable exact CSS selector removal
--no-partial-selectors Disable partial class/id pattern removal
--no-hidden Disable hidden element removal
--no-scoring Disable content scoring removal
--no-patterns Disable content pattern removal
--no-standardize Disable content standardization
--no-replies Exclude replies/comments from extracted content
-h, --help Print help
-V, --version Print version
Library
Quick start
// One-liner with defaults
let result = parse_with_defaults;
assert!;
With options
use ;
let html = r#"<html>
<head><title>My Article - Blog Name</title></head>
<body>
<nav><a href="/">Home</a></nav>
<article>
<h1>My Article</h1>
<p>The actual content you want.</p>
</article>
<footer>Copyright 2025</footer>
</body>
</html>"#;
let mut options = default;
options.url = Some;
let result = parse;
assert_eq!;
assert!;
assert!;
What gets removed
| Category | Examples |
|---|---|
| Ads | .ad, [data-ad-wrapper], .adsense, .promo |
| Navigation | <nav>, .menu, .navbar, [role="navigation"] |
| Sidebars | <aside>, .sidebar, [role="complementary"] |
| Social | .share, .social, share buttons, follow widgets |
| Comments | #comments, .comments-section |
| Footers | <footer>, copyright notices |
| Popups | .modal, .overlay, .popup, cookie banners |
| Hidden | display:none, visibility:hidden, [hidden] |
| Metadata clutter | Bylines, read time, breadcrumbs, tags, TOC |
| Related content | "You might also like", "More stories", card grids |
| Newsletter CTAs | Subscribe forms, email signup blocks |
Extraction pipeline
- Parse HTML and extract schema.org JSON-LD
- Extract metadata (title, author, date, etc.) via priority chains across meta tags, schema.org, and DOM
- Try site-specific extractors (GitHub, Reddit, Hacker News, X/Twitter, Substack, C2 Wiki, BBCode, AI chat conversations)
- Find main content element using scored entry-point selectors
- Standardize math, footnotes, callouts, and code blocks into canonical formats
- Remove ads, navigation, sidebars, and other clutter via CSS selectors
- Remove elements matching ~500 partial class/id patterns
- Score and remove non-content blocks (link-dense, nav indicators)
- Remove content patterns (bylines, read time, boilerplate, related posts)
- Standardize output (clean attributes, normalize headings, resolve URLs, deduplicate images)
- Retry with progressively relaxed filters if too little content was extracted
Metadata priority chains
Each metadata field is extracted using a fallback chain:
- Title:
og:title>twitter:title> schema.orgheadline><meta name="title">><title> - Author:
<meta name="author">> schema.orgauthor.name>[itemprop="author"]>.author - Published: schema.org
datePublished>article:published_time><time>element - Description:
<meta name="description">>og:description>twitter:description> schema.org - Image:
og:image>twitter:image> schema.orgimage - Language:
<html lang>>content-languagemeta >og:locale
Acknowledgements
Inspired by defuddle by Steph Ango. The selector lists, scoring heuristics, and extraction pipeline are adapted from defuddle's approach.
License
MIT