pub struct Readability { /* private fields */ }
Expand description
The main readability parser that extracts clean content from HTML.
Uses Mozilla’s Readability.js algorithm running in an embedded JavaScript engine. Create once and reuse for multiple extractions - the JS context initialization is expensive.
§Examples
use readability_js::{Readability, ReadabilityOptions};
// Create parser (expensive - reuse this!)
let reader = Readability::new()?;
// Basic extraction
let article = reader.extract(html, Some("https://example.com"), None)?;
// With custom options
let options = ReadabilityOptions::new()
.char_threshold(500);
let article = reader.extract(html, Some("https://example.com"), Some(options))?;
§Thread Safety
Readability
instances are not thread-safe (!Send + !Sync
). Each instance
contains an embedded JavaScript engine that cannot be moved between threads or
shared between threads.
Implementations§
Source§impl Readability
impl Readability
Sourcepub fn new() -> Result<Self, ReadabilityError>
pub fn new() -> Result<Self, ReadabilityError>
Creates a new readability parser.
§Performance
This operation is expensive (50-100ms) as it initializes a JavaScript engine and loads the Readability.js library. Create one instance and reuse it for multiple extractions.
§JavaScript Engine
This method initializes an embedded QuickJS runtime. The JavaScript code executed is Mozilla’s Readability.js library and is considered safe for processing untrusted HTML input.
Sourcepub fn parse(&self, html: &str) -> Result<Article, ReadabilityError>
pub fn parse(&self, html: &str) -> Result<Article, ReadabilityError>
Extract readable content from HTML.
This is the main extraction method. It processes the HTML to remove ads, navigation, sidebars and other clutter, leaving just the main article content.
§Arguments
html
- The HTML content to process. Should be a complete HTML document.
§Examples
use readability_js::Readability;
let html = r#"
<html>
<body>
<article>
<h1>Breaking News</h1>
<p>Important news content here...</p>
</article>
<nav>Navigation menu</nav>
<aside>Advertisement</aside>
</body>
</html>
"#;
let reader = Readability::new()?;
let article = reader.parse(html)?;
assert_eq!(article.title, "Breaking News");
assert!(article.content.contains("Important news content"));
// Navigation and ads are removed from the output
§Errors
Returns ReadabilityError
if:
- The HTML is malformed or empty (
HtmlParseError
) - The page fails readability checks (
ReadabilityCheckFailed
) - JavaScript evaluation fails (
JsEvaluation
)
§Performance
This method is fast (typically <10ms) once the Readability
instance
is created. The expensive operation is Readability::new()
which should
be called once and reused.
Sourcepub fn parse_with_url(
&self,
html: &str,
base_url: &str,
) -> Result<Article, ReadabilityError>
pub fn parse_with_url( &self, html: &str, base_url: &str, ) -> Result<Article, ReadabilityError>
Extract readable content from HTML with URL context.
The URL helps with better link resolution and metadata extraction.
§Arguments
html
- The HTML content to extract frombase_url
- The original URL of the page for link resolution
§Examples
use readability_js::Readability;
let reader = Readability::new()?;
let article = reader.parse_with_url(html, "https://example.com/article")?;
// Links in the article will be properly resolved
§Errors
This function will return an error if:
- The HTML is malformed or cannot be parsed (
ReadabilityError::HtmlParseError
) - The base URL is invalid (
ReadabilityError::InvalidOptions
) - The content fails internal readability checks (
ReadabilityError::ReadabilityCheckFailed
) - JavaScript evaluation fails (
ReadabilityError::JsEvaluation
)
Sourcepub fn parse_with_options(
&self,
html: &str,
base_url: Option<&str>,
options: Option<ReadabilityOptions>,
) -> Result<Article, ReadabilityError>
pub fn parse_with_options( &self, html: &str, base_url: Option<&str>, options: Option<ReadabilityOptions>, ) -> Result<Article, ReadabilityError>
Extract readable content with custom parsing options.
§Arguments
html
- The HTML content to extract frombase_url
- Optional URL for link resolutionoptions
- Custom parsing options
§Examples
use readability_js::{Readability, ReadabilityOptions};
let options = ReadabilityOptions::new()
.char_threshold(500);
let reader = Readability::new()?;
let article = reader.parse_with_options(html, Some("https://example.com"), Some(options))?;
§Errors
This function will return an error if:
- The HTML is malformed or cannot be parsed (
ReadabilityError::HtmlParseError
) - The base URL is invalid (
ReadabilityError::InvalidOptions
) - The content fails internal readability checks (
ReadabilityError::ReadabilityCheckFailed
) - JavaScript evaluation fails (
ReadabilityError::JsEvaluation
)