Skip to main content

Crate justext

Crate justext 

Source
Expand description

Paragraph-level boilerplate removal for HTML.

justext classifies HTML paragraphs as content or boilerplate using stopword density, link density, and text length — then refines classifications using neighbor context.

§Quick start

use justext::{justext, get_stoplist, Config};

let html = "<html><body><p>This is the main content.</p></body></html>";
let stoplist = get_stoplist("English").unwrap();
let config = Config::default();
let paragraphs = justext(html, &stoplist, &config);

for p in &paragraphs {
    if !p.is_boilerplate() {
        println!("{}", p.text);
    }
}

Re-exports§

pub use stoplists::available_languages;
pub use stoplists::get_all_stoplists;
pub use stoplists::get_stoplist;

Modules§

stoplists

Structs§

Config
Configuration for the JusText algorithm.
Paragraph
A classified text paragraph extracted from HTML.

Enums§

ClassType
Classification label for a paragraph.
JustextError

Functions§

extract_text
Convenience: extract only the good paragraph text.
justext
Classify paragraphs in HTML as content or boilerplate.