Skip to main content

Crate decruft

Crate decruft 

Source
Expand description

§decruft

Extract clean, readable content from web pages.

Given a noisy HTML page (ads, navigation, sidebars, popups, cookie banners…), decruft extracts the main content and metadata.

§Quick start

use decruft::{parse, DecruftOptions};

let html = r#"<html>
  <head><title>My Post - Blog</title></head>
  <body>
    <nav><a href="/">Home</a></nav>
    <article><h1>My Post</h1><p>The content.</p></article>
    <footer>Copyright 2025</footer>
  </body>
</html>"#;

let result = parse(html, &DecruftOptions::default());
assert!(result.content.contains("The content."));
assert!(!result.content.contains("Copyright"));

Or even simpler with parse_with_defaults:

let html = "<html><body><article><p>Hello</p></article></body></html>";
let result = decruft::parse_with_defaults(html);
assert!(result.content.contains("Hello"));

Structs§

DebugInfo
Debug information about the extraction process.
DecruftOptions
Options for configuring the decruft extraction pipeline.
DecruftResult
Result of the decruft extraction pipeline.
MetaTag
A meta tag from the page.
Removal
A record of a removed element.

Enums§

FetchError
Error returned by fetch_page.

Functions§

fetch_page
Fetch a web page (30s timeout, browser-like UA).
parse
Parse HTML and extract clean, readable content.
parse_with_defaults
Parse HTML with default options.
strip_html_tags
Strip HTML tags and decode common HTML entities, producing plain text suitable for display.