html-filter
Parse HTML into a typed tree, then search for tags, attributes, classes, filter out comments or find the exact data you want with a short builder pattern in no time — zero dependencies, zero overhead.
Why use this crate ?
- For HTML parsing and filtering
- Lightweight and no dependencies
- Public and accessible HTML tree representation
- Easy interface to filter HTML
- Extract some information from some HTML in just a few lines
- Lenient parsing to not crash on non-valid HTML files
- Contextual Filtering: retrieve ancestors of matched nodes, to keep a node based on child content.
Installation
cargo add html_filter
You first need to parse the HTML data:
use *;
let html = parse.unwrap;
let filter = new.except_attribute_value;
assert_eq!;
Filtering
Filter uses a builder pattern: start with Filter::new() and chain as many conditions as you need. Call .filter() on a parsed tree to get back an Html containing every node that matched. You can also call .find() to return only the first node that matches the conditions, or to_filtered/to_found to not consume the Html but still only clone only what is necessary. Here are a few examples:
Select by tag name
use *;
let html = parse.unwrap;
let filter = new.tag_name;
let result = html.filter;
// All three <a> tags are collected into an Html::Vec.
assert_eq!;
// If you want to access the text, you can unwrap this html:
let link_text = result.as_vec.unwrap.iter.map.;
assert_eq!;
Select by attribute value
use *;
let html = parse.unwrap;
// Find only the submit button.
let filter = new.attribute_value;
let result = html.find;
if let Tag = result else
Select by CSS class (space-separated values)
use *;
let html = parse.unwrap;
// Grab only the featured items.
let filter = new.attribute_value_contains;
let result = html.filter;
let items = result.as_vec.unwrap;
assert_eq!;
assert_eq!;
assert_eq!;
Exclude tags or attributes, and exclude white spaces
use *;
let html = parse.unwrap;
let filter = new
.trim // removes white spaces after removing tags
.except_tag_name
.except_tag_name
.except_attribute_value_contains;
let result = html.filter;
assert_eq!;
let = result.as_tag.unwrap;
assert_eq!;
let = inner.as_tag.unwrap;
assert_eq!;
assert_eq!;
assert_eq!;
The depth option: retrieve context around a match
By default, filter returns exactly the nodes that matched. Setting depth(n) tells the filter to also keep up to n levels of ancestors around each match. This is very useful when you want to keep a tag based on its content and not on the tag itself.
use *;
let html = parse.unwrap;
// depth(0) — default: return only the matched <li>.
let filter = new.attribute_value.depth;
if let Vec = html.to_filtered
// depth(1): return the <ul> that contains the matched <li>.
let filter = new.attribute_value.depth;
if let Tag = html.to_filtered
// depth(2): return the <nav>.
let filter = new.attribute_value.depth;
if let Tag = html.filter
Filtering node types
You can strip or keep comments, doctype declarations, and text nodes independently of tag filtering.
use *;
let html = parse.unwrap;
let filter = new
.tag_name
.text // remove text even if in tag
.doctype // force keep doctype even if not in tag
;
assert_eq!;
Convenience methods for common cases:
use *;
let html = parse.unwrap;
// Strip all `<!-- -->` comments.
let r = html.to_filtered;
assert_eq!;
// Strip all `<!…>` doctype nodes.
let r = html.to_filtered;
assert_eq!;
// Strip all bare text nodes.
let r = html.to_filtered;
assert_eq!;
// Keep everything except comments.
let r = html.to_filtered;
assert_eq!;
// Keep only text nodes (no comments or doctypes).
let r = html.to_filtered;
assert_eq!;
Text extraction
use *;
let html = parse.unwrap;
let text = html.filter;
assert_eq!;
Inspecting tags and attributes
Once you have an Html::Tag, you can interrogate its Tag and Attributes directly.
use *;
let html = parse.unwrap;
let = html.as_tag.unwrap;
// Tag name
assert_eq!;
// Tag attributes
assert_eq!;
assert_eq!;
// Tag content
assert_eq!;
into_attr_value consumes the tag and returns the value as an owned String, useful when you want to move the string out without cloning:
use *;
if let Tag = parse.unwrap