justext
Paragraph-level boilerplate removal for HTML.
A Rust port of JusText (v3.0.2) — classifies every paragraph in an HTML page as content or boilerplate using stopword density, link density, and text length, then refines classifications using neighbor context.
Usage
Add to your Cargo.toml:
[]
= "0.2"
use ;
let html = r#"<html><body>
<nav>Menu | About | Contact</nav>
<article>
<p>This is the main article body with enough text to be classified
as content by the stopword density algorithm.</p>
</article>
<footer>Copyright 2024</footer>
</body></html>"#;
let text = extract_text_lang.unwrap;
println!;
For access to the full paragraph classification:
use ;
let paragraphs = justext_lang.unwrap;
for p in ¶graphs
The _lang variants look up the stoplist internally. If you already have a stoplist,
use justext() / extract_text() directly:
use ;
let stoplist = get_stoplist.unwrap;
let text = extract_text;
How it works
Each paragraph goes through two stages:
-
Context-free classification — each paragraph is classified independently using link density, stopword density, and character length:
Good,Bad,NearGood, orShort. -
Context-sensitive revision — four passes use neighboring paragraph classes to promote or demote ambiguous paragraphs, resolving
ShortandNearGoodintoGoodorBad.
Paragraphs classified Good are content; everything else is boilerplate.
What the paragraph struct contains
| Field | Description |
|---|---|
text |
Normalized text content |
class_type |
Final classification (Good, Bad, NearGood, Short) |
initial_class |
Context-free classification (before revision) |
dom_path |
Dot-separated DOM path, e.g. "body.div.p" |
xpath |
XPath with ordinals, e.g. "/html[1]/body[1]/div[2]/p[1]" |
words_count |
Whitespace-split word count |
chars_count_in_links |
Character count inside <a> tags |
tags_count |
Count of inline tags within the paragraph |
heading |
Whether the paragraph is a heading (h0–h9 in dom_path) |
Stoplists
100 languages are bundled and embedded at compile time. Retrieve one by name (case-insensitive):
let stoplist = get_stoplist.unwrap; // HashSet<String>
let stoplist = get_stoplist.unwrap; // case-insensitive
For language-independent extraction, pass an empty stoplist and zero thresholds:
use HashSet;
use ;
let config = default
.with_stopwords_low
.with_stopwords_high;
let text = extract_text;
To get the merged set of all stopwords across every language:
let all = get_all_stoplists; // &'static HashSet<String>
Available languages:
Afrikaans, Albanian, Arabic, Aragonese, Armenian, Aromanian, Asturian, Azerbaijani, Basque, Belarusian, Belarusian_Taraskievica, Bengali, Bishnupriya_Manipuri, Bosnian, Breton, Bulgarian, Catalan, Cebuano, Chuvash, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian, Hebrew, Hindi, Hungarian, Icelandic, Ido, Igbo, Indonesian, Irish, Italian, Javanese, Kannada, Kazakh, Korean, Kurdish, Kyrgyz, Latin, Latvian, Lithuanian, Lombard, Low_Saxon, Luxembourgish, Macedonian, Malay, Malayalam, Maltese, Marathi, Neapolitan, Nepali, Newar, Norwegian_Bokmal, Norwegian_Nynorsk, Occitan, Persian, Piedmontese, Polish, Portuguese, Quechua, Romanian, Russian, Samogitian, Serbian, Serbo_Croatian, Sicilian, Simple_English, Slovak, Slovenian, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tamil, Telugu, Turkish, Turkmen, Ukrainian, Urdu, Uzbek, Vietnamese, Volapuk, Walloon, Waray_Waray, Welsh, West_Frisian, Western_Panjabi, Yoruba
Configuration
All parameters default to the Python JusText 3.0.2 values.
let config = default
.with_length_low // min chars for a non-short paragraph
.with_length_high // min chars for stopword-dense paragraph to be Good
.with_stopwords_low // min stopword density for NearGood
.with_stopwords_high // min stopword density for Good/NearGood branch
.with_max_link_density // max link-char ratio before Bad
.with_max_heading_distance// max chars to scan ahead when promoting headings
.with_no_headings; // set true to disable heading detection
Optional features
| Feature | Description |
|---|---|
tracing |
Enable debug/trace logging (zero-cost when disabled) |
= { = "0.2", = ["tracing"] }
Comparison to readability
| justext | libreadability | |
|---|---|---|
| Unit of extraction | Paragraphs | DOM subtree |
| Output | Vec<Paragraph> (plain text) |
Cleaned HTML |
| Approach | Stopword/link density heuristics | DOM scoring |
| Works best on | Pages without clear <article> structure |
Standard news/blog articles |
The two are complementary. trafilatura uses readability first and falls back to JusText — this crate enables the same pattern in Rust.
Benchmarks
Rust vs Python (jusText) — full pipeline (parse + classify + revise):
| Input | Rust | Python | Speedup |
|---|---|---|---|
| small (2 paragraphs, 733 B) | 21 µs | 202 µs | 10x |
| medium (20 paragraphs, 5 KB) | 98 µs | 1.4 ms | 14x |
| large (100 paragraphs, 34 KB) | 604 µs | 11.1 ms | 18x |
Output comparison
On a 925-file dataset (the trafilatura comparison corpus), Rust and Python produce identical extracted text on 99.4% of files (919/925).
Measured on Apple M4 Max, Rust 1.93, macOS 15.7.
Reproduce:
License
BSD-2-Clause