Expand description
§decruft
Extract clean, readable content from web pages.
Given a noisy HTML page (ads, navigation, sidebars, popups, cookie banners…), decruft extracts the main content and metadata.
§Quick start
use decruft::{parse, DecruftOptions};
let html = r#"<html>
<head><title>My Post - Blog</title></head>
<body>
<nav><a href="/">Home</a></nav>
<article><h1>My Post</h1><p>The content.</p></article>
<footer>Copyright 2025</footer>
</body>
</html>"#;
let result = parse(html, &DecruftOptions::default());
assert!(result.content.contains("The content."));
assert!(!result.content.contains("Copyright"));Or even simpler with parse_with_defaults:
let html = "<html><body><article><p>Hello</p></article></body></html>";
let result = decruft::parse_with_defaults(html);
assert!(result.content.contains("Hello"));Structs§
- Debug
Info - Debug information about the extraction process.
- Decruft
Options - Options for configuring the decruft extraction pipeline.
- Decruft
Result - Result of the decruft extraction pipeline.
- MetaTag
- A meta tag from the page.
- Removal
- A record of a removed element.
Enums§
- Fetch
Error - Error returned by
fetch_page.
Functions§
- fetch_
page - Fetch a web page (30s timeout, browser-like UA).
- parse
- Parse HTML and extract clean, readable content.
- parse_
with_ defaults - Parse HTML with default options.
- strip_
html_ tags - Strip HTML tags and decode common HTML entities, producing plain text suitable for display.