DOM_SMOOTHIE


A Rust crate for extracting readable content from web pages.
dom_smoothie closely follows the implementation of readability.js, bringing its functionality to Rust.
Examples
use std::error::Error;
use dom_smoothie::{Article, Config, Readability};
fn main() -> Result<(), Box<dyn Error>> {
let html = include_str!("../test-pages/rustwiki_2024.html");
let document_url = "https://en.wikipedia.org/wiki/Rust_(programming_language)";
let cfg = Config {
max_elements_to_parse: 9000,
..Default::default()
};
let mut readability = Readability::new(html, Some(document_url), Some(cfg))?;
let article: Article = readability.parse()?;
println!("{:<15} {}","Title:", article.title);
println!("{:<15} {:?}","Byline:", article.byline);
println!("{:<15} {}","Length:", article.length);
println!("{:<15} {:?}","Excerpt:", article.excerpt);
println!("{:<15} {:?}","Site Name:", article.site_name);
println!("{:<15} {:?}", "Dir:", article.dir);
println!("{:<15} {:?}","Published Time:", article.published_time);
println!("{:<15} {:?}","Modified Time:", article.modified_time);
println!("{:<15} {:?}","Image:", article.image);
println!("{:<15} {:?}","URL", article.url);
Ok(())
}
use std::error::Error;
use dom_smoothie::{Metadata, Config, Readability};
fn main() -> Result<(), Box<dyn Error>> {
let html = include_str!("../test-pages/rustwiki_2024.html");
let cfg = Config {
disable_json_ld: false,
..Default::default()
};
let readability = Readability::new(html, None, Some(cfg))?;
let ld_meta: Option<Metadata> = readability.parse_json_ld();
if let Some(ref meta) = ld_meta {
println!("LD META: {:#?}", meta);
}
println!("\n=============\n");
let meta = readability.get_article_metadata(ld_meta);
println!("META: {:#?}", &meta);
Ok(())
}
use std::error::Error;
use dom_query::Document;
use dom_smoothie::Readability;
fn main() -> Result<(), Box<dyn Error>> {
let html = include_str!("../test-pages/rustwiki_2024.html");
let doc: Document = dom_query::Document::from(html);
let readability: Readability = Readability::with_document(doc, None, None)?;
let title: tendril::Tendril<tendril::fmt::UTF8> = readability.get_article_title();
assert_eq!(title, "Rust (programming language) - Wikipedia".into());
println!("Title: {}", title);
Ok(())
}
use std::error::Error;
use dom_smoothie::{Article, Readability, Config};
fn main() -> Result<(), Box<dyn Error>> {
let html = include_str!("../test-pages/rustwiki_2024.html");
let cfg = Config{
readable_min_score: 20.0,
readable_min_content_length: 140,
..Default::default()
};
let mut readability = Readability::new(html, None, Some(cfg))?;
if readability.is_probably_readable() {
let article: Article = readability.parse()?;
println!("{:<15} {}", "Title:", article.title);
println!("{:<15} {:?}", "Byline:", article.byline);
println!("{:<15} {:?}", "Site Name:", article.site_name);
println!("{:<15} {:?}", "URL", article.url);
}
Ok(())
}
Unfortunately, the approach used in mozilla/readability does not always produce the desired
result when extracting meaningful content. Sometimes, this approach discards part of the
content simply because there were fewer than three alternative candidates to the best one.
While this method does a good job, it still relies on too many magic numbers.
After @emschwartz discovered this issue, I decided to add an alternative implementation
for finding the common candidate. Currently, this implementation may produce a less
"clean" result compared to mozilla/readability, but in return, it can capture more of
the meaningful content, whereas the original approach from mozilla/readability may fail in
some cases.
That said, this approach is not necessarily superior to the original—there is still
room for improvement.
use std::error::Error;
use dom_smoothie::{Article, Config, Readability, CandidateSelectMode};
fn main() -> Result<(), Box<dyn Error>> {
let html = include_str!("../test-pages/alt/arstechnica/source.html");
let cfg = Config {
candidate_select_mode: CandidateSelectMode::DomSmoothie,
..Default::default()
};
let mut readability = Readability::new(html, None, Some(cfg))?;
let article: Article = readability.parse()?;
println!("Text Content: {}", article.text_content);
Ok(())
}
By default, the text content is output as-is, without formatting,
preserving whitespace from the original HTML document.
Depending on the document's initial markup, this can be quite verbose and inconvenient.
To retrieve formatted text content, set text_mode: TextMode::Formatted in the config.
This formatting does not preserve table structures, meaning table data may be output as plain text without column alignment.
While this formatting is not as structured as Markdown, it provides a cleaner output compared to raw text.
TextMode::Markdown enables Markdown formatting.
use std::error::Error;
use dom_smoothie::{Article, Config, Readability, TextMode};
fn main() -> Result<(), Box<dyn Error>> {
let html = include_str!("../test-pages/hacker_news.html");
let cfg = Config {
text_mode: TextMode::Formatted,
..Default::default()
};
let mut readability = Readability::new(html, None, Some(cfg))?;
let article: Article = readability.parse()?;
println!("Text Content: {}", article.text_content);
Ok(())
}
The Readability::parse_with_policy method allows parsing content with a specific policy.
This method follows the same steps as Readability::parse but makes only a single attempt using the specified ParsePolicy.
As a result, it doesn't store the best attempt, leading to significantly lower memory consumption. Some policies may also be faster than others.
Typically, ParsePolicy::Strict is the slowest but provides the cleanest result. ParsePolicy::Moderate can also yield a good result, while the others may be less accurate.
In some cases, using certain policies (e.g., ParsePolicy::Strict) may result in a ReadabilityError::GrabFailed error, whereas Readability::parse might succeed.
This happens because Readability::parse attempts parsing with different policies (essentially a set of grab flags) until it either succeeds or exhausts all options.
use std::error::Error;
use dom_smoothie::{ParsePolicy, Readability};
fn main() -> Result<(), Box<dyn Error>> {
let html = include_str!("../test-pages/readability/lazy-image-3/source.html");
let cases: [(ParsePolicy, bool); 4] = [
(ParsePolicy::Strict, false),
(ParsePolicy::Moderate, false),
(ParsePolicy::Clean, false),
(ParsePolicy::Raw, true),
];
for (policy, expected) in cases {
let mut r = Readability::new(html, None, None)?;
let article = r.parse_with_policy(policy);
assert_eq!(article.is_ok(), expected);
}
Ok(())
}
Crate Features
serde: Enables the serde::Serialize and serde::Deserialize traits for the Article, Metadata, and Config structures.
See Also
Changelog
Changelog
License
Licensed under MIT (LICENSE or http://opensource.org/licenses/MIT).
Contribution
Any contribution intentionally submitted for inclusion in this project will be licensed under the MIT license, without any additional terms or conditions.