Crate decruft

Expand description

§decruft

Extract clean, readable content from web pages.

Given a noisy HTML page (ads, navigation, sidebars, popups, cookie banners…), decruft extracts the main content and metadata.

§Quick start

use decruft::{parse, DecruftOptions};

let html = r#"<html>
  <head><title>My Post - Blog</title></head>
  <body>
    <nav><a href="/">Home</a></nav>
    <article><h1>My Post</h1><p>The content.</p></article>
    <footer>Copyright 2025</footer>
  </body>
</html>"#;

let result = parse(html, &DecruftOptions::default());
assert!(result.content.contains("The content."));
assert!(!result.content.contains("Copyright"));

Or even simpler with parse_with_defaults:

let html = "<html><body><article><p>Hello</p></article></body></html>";
let result = decruft::parse_with_defaults(html);
assert!(result.content.contains("Hello"));

Structs§

DebugInfo: Debug information about the extraction process.
DecruftOptions: Options for configuring the decruft extraction pipeline.
DecruftResult: Result of the decruft extraction pipeline.
MetaTag: A meta tag from the page.
Removal: A record of a removed element.

Enums§

FetchError: Error returned by fetch_page.

Functions§

fetch_page: Fetch a web page (30s timeout, browser-like UA).
parse: Parse HTML and extract clean, readable content.
parse_with_defaults: Parse HTML with default options.
strip_html_tags: Strip HTML tags and decode HTML entities, producing plain text.

Crate decruft

Crate decruft Copy item path

§decruft

§Quick start

Structs§

Enums§

Functions§

Crate decruft