readabilityrs 0.1.0

# ReadabilityRS

readabilityrs extracts article content from HTML web pages using Mozilla's Readability algorithm. The library identifies and isolates the main article text, removing navigation, advertisements, and other clutter. Metadata extraction includes title, author, publication date, and excerpt generation.

This is a Rust port of [Mozilla's Readability.js](https://github.com/mozilla/readability), which powers Firefox's Reader View. The implementation passes 93.8% of Mozilla's test suite (122/130 tests) with full document preprocessing support. Built in Rust for performance, memory safety, and zero-cost abstractions.

The library provides both DOM-based content extraction and metadata parsing from multiple sources including JSON-LD, OpenGraph, Twitter Cards, and semantic HTML attributes. Content scoring algorithms identify the main article container while filtering sidebars, comments, and related content sections. Supports standard web article formats including news sites, blogs, documentation, and long-form content.

## Install
Add to your project:

```bash
cargo add readabilityrs
```

Or add to your Cargo.toml:

```toml
[dependencies]
readabilityrs = "0.1.0"
```

## Usage
The library provides a simple API for parsing HTML documents. Create a `Readability` instance with HTML content, an optional base URL for resolving relative links, and optional configuration. Call `parse()` to extract the article.

```rust
use readabilityrs::Readability;

let html = r#"
    <html>
        <head><title>Example Article</title></head>
        <body>
            <article>
                <h1>Article Title</h1>
                <p>This is the main article content.</p>
            </article>
        </body>
    </html>
"#;

let readability = Readability::new(html, None, None)?;
if let Some(article) = readability.parse() {
    println!("Title: {}", article.title.unwrap_or_default());
    println!("Content: {}", article.content.unwrap_or_default());
    println!("Length: {} chars", article.length);
}
```

## Content Extraction
The library uses Mozilla's proven content scoring algorithm to identify the main article container. Scoring considers element types where paragraph and article tags score higher than generic divs, text density measured by character count and comma frequency, link density to filter navigation-heavy sections, and class name patterns matching positive indicators like "article" or "content" versus negative patterns like "sidebar" or "advertisement".

Document preprocessing removes scripts and styles before extraction, unwraps noscript tags to reveal lazy-loaded images, and replaces deprecated font tags with span elements. This preprocessing step improves extraction accuracy by 2.3 percentage points compared to parsing raw HTML. Content cleaning maintains tables with data while removing layout tables, preserves images and videos within the article, and filters hidden elements and empty paragraphs.

```rust
use readabilityrs::Readability;

let html = fetch_article("https://example.com/article");
let readability = Readability::new(&html, Some("https://example.com"), None)?;

if let Some(article) = readability.parse() {
    let clean_html = article.content.unwrap();
    let plain_text = article.text_content.unwrap();
    let char_count = article.length;
}
```

## Metadata Extraction
Metadata extraction follows a priority chain starting with JSON-LD structured data, then OpenGraph meta tags, Twitter Cards, Dublin Core, and standard meta tags. Byline detection searches for rel="author" links, itemprop="author" elements, and common CSS classes like "byline" or "author". Title extraction removes site names using separator detection, handling both pipe and colon separators intelligently.

Excerpt generation selects the first substantial paragraph while filtering navigation menus, hatnotes, and other noise. The system skips paragraphs under 25 characters and validates content quality before selection. Language detection uses the html lang attribute or Content-Language meta tags. Publication time parsing supports ISO 8601 timestamps from article:published_time and datePublished fields.

```rust
let readability = Readability::new(&html, None, None)?;
if let Some(article) = readability.parse() {
    println!("Title: {:?}", article.title);
    println!("Author: {:?}", article.byline);
    println!("Excerpt: {:?}", article.excerpt);
    println!("Published: {:?}", article.published_time);
}
```

## Configuration
Configuration controls parsing behavior through `ReadabilityOptions`. Debug mode enables detailed logging for troubleshooting extraction issues. Character thresholds determine minimum article length to accept. The builder pattern provides a fluent API for setting options including element parsing limits, candidate selection count, class preservation rules, and link density scoring adjustments.

```rust
use readabilityrs::{Readability, ReadabilityOptions};

let options = ReadabilityOptions::builder()
    .debug(true)
    .char_threshold(500)
    .nb_top_candidates(5)
    .keep_classes(false)
    .classes_to_preserve(vec!["page".to_string()])
    .disable_json_ld(false)
    .link_density_modifier(0.0)
    .build();

let readability = Readability::new(&html, None, Some(options))?;
```

## URL Handling
Base URL resolution converts relative links to absolute URLs. Image sources, anchors, and embedded content maintain correct paths for display outside the original context. URL validation ensures proper format before parsing, returning errors rather than failing silently during extraction.

```rust
let readability = Readability::new(
    &html,
    Some("https://example.com/articles/2024/post"),
    None
)?;

if let Some(article) = readability.parse() {
    // All relative URLs converted to absolute
    println!("{}", article.content.unwrap());
}
```

## Error Handling
The library returns `Result` types for operations that can fail. Common errors include invalid URLs, malformed HTML, and parsing failures. The `NoContentFound` error indicates the algorithm could not identify article content.

```rust
use readabilityrs::{Readability, error::ReadabilityError};

fn extract_article(html: &str, url: &str) -> Result<String, ReadabilityError> {
    let readability = Readability::new(html, Some(url), None)?;
    let article = readability.parse().ok_or(ReadabilityError::NoContentFound)?;
    Ok(article.content.unwrap_or_default())
}
```

The scoring system assigns points to elements based on tag types where article tags receive 8 points, section tags receive 8 points, paragraph tags receive 5 points, and div tags receive 2-5 points depending on whether they contain block-level children. Class and ID patterns add or subtract 25 points based on positive keywords like "article" and "content" versus negative keywords like "sidebar" and "comment". Content metrics include comma count as a signal of substantial text and character length bonuses up to 3 points for paragraphs over 300 characters. Link density penalties reduce scores for navigation-heavy sections.

Document preprocessing occurs before content extraction in a specific sequence. Scripts and styles are removed completely to eliminate noise. Font tags are replaced with span elements for consistent parsing. Noscript tags are unwrapped to reveal lazy-loaded images that would otherwise remain hidden. Form elements are removed as they typically don't contribute to article content. After preprocessing the HTML is reparsed to create a clean DOM structure for extraction.

Metadata extraction follows a strict priority chain where JSON-LD structured data takes highest priority, followed by OpenGraph meta tags, then Twitter Card meta tags, Dublin Core meta tags, and standard meta tags. Later sources only fill missing fields without overriding earlier sources. DOM-based byline extraction can override meta tags when confidence is high, such as finding rel="author" links or itemprop="author" elements with author-like text content.

### Test Compatibility
The implementation passes 122 of 130 tests from Mozilla's test suite achieving 93.8% compatibility. The 8 failing tests represent editorial judgment differences rather than implementation errors. Four cases involve more sensible choices in our implementation such as avoiding bylines extracted from related article sidebars and preferring author names over timestamps. Four cases involve subjective paragraph selection for excerpts where both the reference and our implementation make valid choices. Full document preprocessing is enabled matching Mozilla's production behavior.

## Performance
Built in Rust for performance and memory safety. Zero-cost abstractions enable optimizations without runtime overhead. Minimal allocations during parsing through efficient string handling and DOM traversal. The library processes typical news articles in milliseconds on modern hardware. Memory usage scales with document size, typically under 10MB for standard web pages.

## Credits
This is a Rust port of [Mozilla's Readability](https://github.com/mozilla/readability), originally based on Arc90's readability.js. The test suite and algorithm design are from Mozilla's implementation under Apache 2.0 license.

## Contributing
For information regarding contributions, please refer to [CONTRIBUTING.md](CONTRIBUTING.md) file.

## License
Licensed under the Apache License, Version 2.0. See [LICENSE](LICENSE) file for details.