clean_html_fragment

Function clean_html_fragment 

Source
pub fn clean_html_fragment(
    html_fragment: &str,
    base_url: &Url,
) -> Result<CleanedHtml, FullTextParserError>
Expand description

Re-use crate internals to clean HTML fragments of articles before further processing:

  • replace H1 with H2
  • rename all font nodes to span
  • unwrap noscript images
  • strip noscript tags
  • fix lazy-load images
  • fix iframe size
  • remove onclick of links
  • strip elements using Readability.com and Instapaper.com ignore class names
  • strip elements that contain style=“display: none;”
  • strip styles
  • strip input elements
  • strip comments
  • strip scripts
  • strip all external css and fonts
  • complete relative urls
  • simplify nested elements

§Arguments

  • html - HTML content
  • base_url - URL used to complete relative URLs