#[non_exhaustive]pub struct Parser {
pub max_elems_to_parse: usize,
pub n_top_candidates: usize,
pub char_threshold: usize,
pub classes_to_preserve: Vec<String>,
pub keep_classes: bool,
pub tags_to_score: Vec<String>,
pub disable_jsonld: bool,
pub allowed_video_regex: Option<Regex>,
/* private fields */
}Expand description
Port of Parser — the core readability extraction engine.
Create with Parser::new(), configure public fields as needed, then call
parse() to extract an article.
A single Parser can be reused for multiple documents — internal state is
fully reset at the start of each parse call. However, Parser is not
thread-safe: it requires &mut self for parsing, so it cannot be shared
across threads without external synchronization.
Fields (Non-exhaustive)§
This struct is marked as non-exhaustive
Struct { .. } syntax; cannot be matched against without a wildcard ..; and struct update syntax will not work.max_elems_to_parse: usizeMax DOM nodes to process. 0 = unlimited. Port of MaxElemsToParse.
n_top_candidates: usizeNumber of top candidates to compare during scoring. Port of NTopCandidates.
char_threshold: usizeMinimum character count for accepted article content. Port of CharThreshold.
classes_to_preserve: Vec<String>CSS class names to preserve when keep_classes is false. Port of ClassesToPreserve.
keep_classes: boolIf true, keep all class attributes. Port of KeepClasses.
Tag names eligible for content scoring. Port of TagsToScore.
disable_jsonld: boolDisable JSON-LD metadata extraction. Port of DisableJSONLD.
allowed_video_regex: Option<Regex>Optional regex for video URLs to allow. Port of AllowedVideoRegex.
Implementations§
Source§impl Parser
impl Parser
Sourcepub fn parse(
&mut self,
html: &str,
page_url: Option<&Url>,
) -> Result<Article, Error>
pub fn parse( &mut self, html: &str, page_url: Option<&Url>, ) -> Result<Article, Error>
Port of Parse — parse an HTML string and return the article.
Sourcepub fn check_html(&self, html: &str) -> bool
pub fn check_html(&self, html: &str) -> bool
Convenience wrapper: parse html and check readability without a pre-parsed Document.
Equivalent to CheckDocument(html.Parse(html)) in Go tests.
Source§impl Parser
impl Parser
Sourcepub fn with_max_elems_to_parse(self, n: usize) -> Self
pub fn with_max_elems_to_parse(self, n: usize) -> Self
Set the maximum number of DOM elements to parse. 0 = unlimited.
Sourcepub fn with_n_top_candidates(self, n: usize) -> Self
pub fn with_n_top_candidates(self, n: usize) -> Self
Set the number of top candidates to compare during scoring.
Sourcepub fn with_char_threshold(self, n: usize) -> Self
pub fn with_char_threshold(self, n: usize) -> Self
Set the minimum character count for accepted article content.
Sourcepub fn with_classes_to_preserve(
self,
classes: impl IntoIterator<Item = impl Into<String>>,
) -> Self
pub fn with_classes_to_preserve( self, classes: impl IntoIterator<Item = impl Into<String>>, ) -> Self
Set CSS class names to preserve when keep_classes is false.
Sourcepub fn with_keep_classes(self, keep: bool) -> Self
pub fn with_keep_classes(self, keep: bool) -> Self
If true, keep all class attributes on extracted content.
Set tag names eligible for content scoring.
Sourcepub fn with_disable_jsonld(self, disable: bool) -> Self
pub fn with_disable_jsonld(self, disable: bool) -> Self
Disable JSON-LD metadata extraction.
Sourcepub fn with_allowed_video_regex(self, re: Regex) -> Self
pub fn with_allowed_video_regex(self, re: Regex) -> Self
Set a regex for video URLs to allow during cleaning.