Expand description
Paragraph-level boilerplate removal for HTML.
justext classifies HTML paragraphs as content or boilerplate using
stopword density, link density, and text length — then refines
classifications using neighbor context.
§Quick start
use justext::{justext, get_stoplist, Config};
let html = "<html><body><p>This is the main content.</p></body></html>";
let stoplist = get_stoplist("English").unwrap();
let config = Config::default();
let paragraphs = justext(html, &stoplist, &config);
for p in ¶graphs {
if !p.is_boilerplate() {
println!("{}", p.text);
}
}Re-exports§
pub use stoplists::available_languages;pub use stoplists::get_all_stoplists;pub use stoplists::get_stoplist;
Modules§
Structs§
- Config
- Configuration for the JusText algorithm.
- Paragraph
- A classified text paragraph extracted from HTML.
Enums§
- Class
Type - Classification label for a paragraph.
- Justext
Error
Functions§
- extract_
text - Convenience: extract only the good paragraph text.
- justext
- Classify paragraphs in HTML as content or boilerplate.