# decruft
Extract clean, readable content from web pages. A Rust port of [defuddle](https://github.com/kepano/defuddle).
[](https://github.com/jamtur01/decruft/actions/workflows/ci.yml)
[](https://crates.io/crates/decruft)
[](https://docs.rs/decruft)
[](LICENSE)
## What it does
Given a noisy HTML page (ads, navigation, sidebars, popups, tracking pixels, cookie banners...), decruft extracts the main content and metadata:
- **Content** — the article/post body as clean HTML
- **Metadata** — title, author, published date, description, image, language, site name, favicon
- **Schema.org** — parsed JSON-LD data
## Install
```sh
cargo install decruft
```
Or add to your `Cargo.toml`:
```toml
[dependencies]
decruft = "0.1"
```
## CLI
```sh
# Fetch a URL (auto-detected)
decruft https://example.com/article
# From file
decruft page.html --url https://example.com/page
# From stdin
# Output formats: json (default), html, text, markdown
decruft page.html -f html
decruft page.html -f text
decruft page.html -f markdown
# Debug mode (shows what was removed and why)
### Options
```
Usage: decruft [OPTIONS] [INPUT]
Arguments:
[INPUT] Path to an HTML file to process. Use - for stdin [default: -]
Options:
-u, --url <URL> URL (for resolving relative URLs and metadata)
-s, --selector <SELECTOR> CSS selector to override content root detection
-f, --format <FORMAT> Output format: json, html, text, or markdown [default: json]
-d, --debug Include removal details in output
--no-images Strip all images
--no-exact-selectors Disable exact CSS selector removal
--no-partial-selectors Disable partial class/id pattern removal
--no-hidden Disable hidden element removal
--no-scoring Disable content scoring removal
--no-patterns Disable content pattern removal
--no-standardize Disable content standardization
--no-replies Exclude replies/comments from extracted content
-h, --help Print help
-V, --version Print version
```
## Library
### Quick start
```rust
// One-liner with defaults
let result = decruft::parse_with_defaults("<html><body><article><p>Content</p></article></body></html>");
assert!(result.content.contains("Content"));
```
### With options
```rust
use decruft::{parse, DecruftOptions};
let html = r#"<html>
<head><title>My Article - Blog Name</title></head>
<body>
<nav><a href="/">Home</a></nav>
<article>
<h1>My Article</h1>
<p>The actual content you want.</p>
</article>
<footer>Copyright 2025</footer>
</body>
</html>"#;
let mut options = DecruftOptions::default();
options.url = Some("https://example.com/article".into());
let result = parse(html, &options);
assert_eq!(result.title, "My Article - Blog Name");
assert!(result.content.contains("actual content"));
assert!(!result.content.contains("Copyright"));
```
### What gets removed
| **Ads** | `.ad`, `[data-ad-wrapper]`, `.adsense`, `.promo` |
| **Navigation** | `<nav>`, `.menu`, `.navbar`, `[role="navigation"]` |
| **Sidebars** | `<aside>`, `.sidebar`, `[role="complementary"]` |
| **Social** | `.share`, `.social`, share buttons, follow widgets |
| **Comments** | `#comments`, `.comments-section` |
| **Footers** | `<footer>`, copyright notices |
| **Popups** | `.modal`, `.overlay`, `.popup`, cookie banners |
| **Hidden** | `display:none`, `visibility:hidden`, `[hidden]` |
| **Metadata clutter** | Bylines, read time, breadcrumbs, tags, TOC |
| **Related content** | "You might also like", "More stories", card grids |
| **Newsletter CTAs** | Subscribe forms, email signup blocks |
### Extraction pipeline
1. Parse HTML and extract schema.org JSON-LD
2. Extract metadata (title, author, date, etc.) via priority chains across meta tags, schema.org, and DOM
3. Try site-specific extractors (GitHub, Reddit, Hacker News, X/Twitter, Substack, C2 Wiki, BBCode, AI chat conversations)
4. Find main content element using scored entry-point selectors
5. Standardize math, footnotes, callouts, and code blocks into canonical formats
6. Remove ads, navigation, sidebars, and other clutter via CSS selectors
7. Remove elements matching ~500 partial class/id patterns
8. Score and remove non-content blocks (link-dense, nav indicators)
9. Remove content patterns (bylines, read time, boilerplate, related posts)
10. Standardize output (clean attributes, normalize headings, resolve URLs, deduplicate images)
11. Retry with progressively relaxed filters if too little content was extracted
## Metadata priority chains
Each metadata field is extracted using a fallback chain:
- **Title**: `og:title` > `twitter:title` > schema.org `headline` > `<meta name="title">` > `<title>`
- **Author**: `<meta name="author">` > schema.org `author.name` > `[itemprop="author"]` > `.author`
- **Published**: schema.org `datePublished` > `article:published_time` > `<time>` element
- **Description**: `<meta name="description">` > `og:description` > `twitter:description` > schema.org
- **Image**: `og:image` > `twitter:image` > schema.org `image`
- **Language**: `<html lang>` > `content-language` meta > `og:locale`
## Acknowledgements
Inspired by [defuddle](https://github.com/kepano/defuddle) by Steph Ango. The selector lists, scoring heuristics, and extraction pipeline are adapted from defuddle's approach.
Test fixtures include pages from defuddle's test suite and Mozilla's [Readability.js](https://github.com/mozilla/readability) test suite (via [readabilityrs](https://github.com/theiskaa/readabilityrs)).
## License
MIT