decruft 0.1.3

Extract clean, readable content from web pages
Documentation

decruft

Extract clean, readable content from web pages. A Rust port of defuddle.

CI Crates.io Docs License: MIT

What it does

Given a noisy HTML page (ads, navigation, sidebars, popups, tracking pixels, cookie banners...), decruft extracts the main content and metadata:

  • Content — the article/post body as clean HTML
  • Metadata — title, author, published date, description, image, language, site name, favicon
  • Schema.org — parsed JSON-LD data

Install

cargo install decruft

Or add to your Cargo.toml:

[dependencies]
decruft = "0.1"

CLI

# Fetch a URL (auto-detected)
decruft https://example.com/article

# From file
decruft page.html --url https://example.com/page

# From stdin
cat page.html | decruft --url https://example.com

# Output formats: json (default), html, text, markdown
decruft page.html -f html
decruft page.html -f text
decruft page.html -f markdown

# Debug mode (shows what was removed and why)
decruft page.html --debug | jq '.debug.removals'

Options

Usage: decruft [OPTIONS] [INPUT]

Arguments:
  [INPUT]  Path to an HTML file to process. Use - for stdin [default: -]

Options:
  -u, --url <URL>             URL (for resolving relative URLs and metadata)
  -s, --selector <SELECTOR>   CSS selector to override content root detection
  -f, --format <FORMAT>       Output format: json, html, text, or markdown [default: json]
  -d, --debug                 Include removal details in output
      --no-images             Strip all images
      --no-exact-selectors    Disable exact CSS selector removal
      --no-partial-selectors  Disable partial class/id pattern removal
      --no-hidden             Disable hidden element removal
      --no-scoring            Disable content scoring removal
      --no-patterns           Disable content pattern removal
      --no-standardize        Disable content standardization
      --no-replies            Exclude replies/comments from extracted content
  -h, --help                  Print help
  -V, --version               Print version

Library

Quick start

// One-liner with defaults
let result = decruft::parse_with_defaults("<html><body><article><p>Content</p></article></body></html>");
assert!(result.content.contains("Content"));

With options

use decruft::{parse, DecruftOptions};

let html = r#"<html>
  <head><title>My Article - Blog Name</title></head>
  <body>
    <nav><a href="/">Home</a></nav>
    <article>
      <h1>My Article</h1>
      <p>The actual content you want.</p>
    </article>
    <footer>Copyright 2025</footer>
  </body>
</html>"#;

let mut options = DecruftOptions::default();
options.url = Some("https://example.com/article".into());

let result = parse(html, &options);

assert_eq!(result.title, "My Article - Blog Name");
assert!(result.content.contains("actual content"));
assert!(!result.content.contains("Copyright"));

What gets removed

Category Examples
Ads .ad, [data-ad-wrapper], .adsense, .promo
Navigation <nav>, .menu, .navbar, [role="navigation"]
Sidebars <aside>, .sidebar, [role="complementary"]
Social .share, .social, share buttons, follow widgets
Comments #comments, .comments-section
Footers <footer>, copyright notices
Popups .modal, .overlay, .popup, cookie banners
Hidden display:none, visibility:hidden, [hidden]
Metadata clutter Bylines, read time, breadcrumbs, tags, TOC
Related content "You might also like", "More stories", card grids
Newsletter CTAs Subscribe forms, email signup blocks

Extraction pipeline

  1. Parse HTML and extract schema.org JSON-LD
  2. Extract metadata (title, author, date, etc.) via priority chains across meta tags, schema.org, and DOM
  3. Try site-specific extractors (GitHub, Reddit, Hacker News, X/Twitter, Substack, C2 Wiki, BBCode, AI chat conversations)
  4. Find main content element using scored entry-point selectors
  5. Standardize math, footnotes, callouts, and code blocks into canonical formats
  6. Remove ads, navigation, sidebars, and other clutter via CSS selectors
  7. Remove elements matching ~500 partial class/id patterns
  8. Score and remove non-content blocks (link-dense, nav indicators)
  9. Remove content patterns (bylines, read time, boilerplate, related posts)
  10. Standardize output (clean attributes, normalize headings, resolve URLs, deduplicate images)
  11. Retry with progressively relaxed filters if too little content was extracted

Metadata priority chains

Each metadata field is extracted using a fallback chain:

  • Title: og:title > twitter:title > schema.org headline > <meta name="title"> > <title>
  • Author: <meta name="author"> > schema.org author.name > [itemprop="author"] > .author
  • Published: schema.org datePublished > article:published_time > <time> element
  • Description: <meta name="description"> > og:description > twitter:description > schema.org
  • Image: og:image > twitter:image > schema.org image
  • Language: <html lang> > content-language meta > og:locale

Acknowledgements

Inspired by defuddle by Steph Ango. The selector lists, scoring heuristics, and extraction pipeline are adapted from defuddle's approach.

Test fixtures include pages from defuddle's test suite and Mozilla's Readability.js test suite (via readabilityrs).

License

MIT