decruft

Extract clean, readable content from web pages. A Rust port of defuddle.

What it does

Given a noisy HTML page (ads, navigation, sidebars, popups, tracking pixels, cookie banners...), decruft extracts the main content and metadata:

Content — the article/post body as clean HTML
Metadata — title, author, published date, description, image, language, site name, favicon
Schema.org — parsed JSON-LD data

Install

cargo install decruft

Or add to your Cargo.toml:

[dependencies]
decruft = "0.1"

CLI

# Fetch a URL (auto-detected)
decruft https://example.com/article

# From file
decruft page.html --url https://example.com/page

# From stdin
cat page.html | decruft --url https://example.com

# Output formats: json (default), html, text, markdown
decruft page.html -f html
decruft page.html -f text
decruft page.html -f markdown

# Debug mode (shows what was removed and why)
decruft page.html --debug | jq '.debug.removals'

Options

Usage: decruft [OPTIONS] [INPUT]

Arguments:
  [INPUT]  Path to an HTML file to process. Use - for stdin [default: -]

Options:
  -u, --url <URL>             URL (for resolving relative URLs and metadata)
  -s, --selector <SELECTOR>   CSS selector to override content root detection
  -f, --format <FORMAT>       Output format: json, html, text, or markdown [default: json]
  -d, --debug                 Include removal details in output
      --no-images             Strip all images
      --no-exact-selectors    Disable exact CSS selector removal
      --no-partial-selectors  Disable partial class/id pattern removal
      --no-hidden             Disable hidden element removal
      --no-scoring            Disable content scoring removal
      --no-patterns           Disable content pattern removal
      --no-standardize        Disable content standardization
      --no-replies            Exclude replies/comments from extracted content
  -h, --help                  Print help
  -V, --version               Print version

Library

Quick start

// One-liner with defaults
let result = decruft::parse_with_defaults("<html><body><article><p>Content</p></article></body></html>");
assert!(result.content.contains("Content"));

With options

use decruft::{parse, DecruftOptions};

let html = r#"<html>
  <head><title>My Article - Blog Name</title></head>
  <body>
    <nav><a href="/">Home</a></nav>
    <article>
      <h1>My Article</h1>
      <p>The actual content you want.</p>
    </article>
    <footer>Copyright 2025</footer>
  </body>
</html>"#;

let mut options = DecruftOptions::default();
options.url = Some("https://example.com/article".into());

let result = parse(html, &options);

assert_eq!(result.title, "My Article - Blog Name");
assert!(result.content.contains("actual content"));
assert!(!result.content.contains("Copyright"));

What gets removed

Category	Examples
Ads	`.ad`, `[data-ad-wrapper]`, `.adsense`, `.promo`
Navigation	`<nav>`, `.menu`, `.navbar`, `[role="navigation"]`
Sidebars	`<aside>`, `.sidebar`, `[role="complementary"]`
Social	`.share`, `.social`, share buttons, follow widgets
Comments	`#comments`, `.comments-section`
Footers	`<footer>`, copyright notices
Popups	`.modal`, `.overlay`, `.popup`, cookie banners
Hidden	`display:none`, `visibility:hidden`, `[hidden]`
Metadata clutter	Bylines, read time, breadcrumbs, tags, TOC
Related content	"You might also like", "More stories", card grids
Newsletter CTAs	Subscribe forms, email signup blocks

Extraction pipeline

Parse HTML and extract schema.org JSON-LD
Extract metadata (title, author, date, etc.) via priority chains across meta tags, schema.org, and DOM
Try site-specific extractors (GitHub, Reddit, Hacker News, X/Twitter, Substack, C2 Wiki, BBCode, AI chat conversations)
Find main content element using scored entry-point selectors
Standardize math, footnotes, callouts, and code blocks into canonical formats
Remove ads, navigation, sidebars, and other clutter via CSS selectors
Remove elements matching ~500 partial class/id patterns
Score and remove non-content blocks (link-dense, nav indicators)
Remove content patterns (bylines, read time, boilerplate, related posts)
Standardize output (clean attributes, normalize headings, resolve URLs, deduplicate images)
Retry with progressively relaxed filters if too little content was extracted

Metadata priority chains

Each metadata field is extracted using a fallback chain:

Title: og:title > twitter:title > schema.org headline > <meta name="title"> > <title>
Author: <meta name="author"> > schema.org author.name > [itemprop="author"] > .author
Published: schema.org datePublished > article:published_time > <time> element
Description: <meta name="description"> > og:description > twitter:description > schema.org
Image: og:image > twitter:image > schema.org image
Language: <html lang> > content-language meta > og:locale

Acknowledgements

Inspired by defuddle by Steph Ango. The selector lists, scoring heuristics, and extraction pipeline are adapted from defuddle's approach.

Test fixtures include pages from defuddle's test suite and Mozilla's Readability.js test suite (via readabilityrs).

License

MIT

decruft 0.1.3