Skip to main content

extract

Function extract 

Source
pub fn extract(
    html_str: &str,
    doc_uri: &str,
    options: ExtractOptions,
) -> Product
Expand description

Extract the main article content from an HTML page.

This is the primary entry-point of the crate. It implements the Readability algorithm: scoring candidate nodes by content density, pruning navigation / boilerplate, and returning the best content subtree along with any metadata (title, byline, etc.) that could be extracted.

§Arguments

  • html_str – the raw HTML source of the page.
  • doc_uri – the URL the page was fetched from. Used to resolve relative URLs in <a href>, <img src>, srcset, etc.
  • options – tuning knobs for the extraction algorithm. ExtractOptions::default() is a sensible starting point.

§Returns

A Product whose content field is Some if article content was found, or None if the page did not contain extractable content.

§Examples

use readable_rs::{extract, ExtractOptions};

let html = "<html><body><p>Short.</p></body></html>";
let product = extract(html, "https://example.com", ExtractOptions::default());
// product.content may be None — the paragraph is below the default char_threshold.