Expand description
Parse HTML tag chunks
§Overview
The primary type in this crate is a Node
.
You can create one by parsing some HTML.
Nodes can have at most one parent node, zero or more sibling nodes and zero or more child nodes.
There are 5 types of nodes, and methods to convert between a node and it’s underlying kind.
Element
is probably the most common; it represents an HTML tag.
All node kinds implement Deref<Target = Node>
, and so inherit all the Node
methods.
§Quickstart
This introduction is heavily based that of python’s beautifulsoup
We’ll be using the following HTML fragment as an example throughout:
let html = r#"
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story </b></p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie </a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie </a>and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body>
"#;
The HTML will be parsed into an arena allocator, which we will need to create next. The lifetimes of all the parsed nodes & elements will be tied to this object. Once it goes out of scope, everything is cleaned up.
let arena = htmlite::NodeArena::new();
Now we can parse the html. You’ll get back the root node which contains all the top level nodes in the fragment.
The root nodes contains the <head>
and <body>
elements.
It also contains 3 whitespace text nodes; before <head>
, between <head>
and <body>
, and after <body>
.
let root = arena.parse(html).unwrap();
let child_elements = root
.children()
.filter_map(htmlite::Node::as_element)
.collect::<Vec<_>>();
let child_text = root
.children()
.filter_map(htmlite::Node::as_text)
.collect::<Vec<_>>();
assert_eq!(child_elements.len(), 2);
assert_eq!(child_text.len(), 3);
assert_eq!(child_elements[0].name, "head");
assert_eq!(child_elements[1].name, "body");
Here are some ways to navigate the data structure:
let title = root.select("title").next().unwrap();
assert_eq!(title.html(), "<title>The Dormouse's story</title>");
assert_eq!(title.name, "title");
assert_eq!(title.text_content().collect::<String>(), "The Dormouse's story");
assert_eq!(title.parent().and_then(htmlite::Node::as_element).unwrap().name, "head");
let first_p = root.select("p").next().unwrap();
assert_eq!(first_p.html(), r#"<p class="title"><b>The Dormouse's story</b></p>"#);
assert_eq!(first_p["class"].value(), "title");
One common task is extracting all the URLs found within a page’s <a>
tags:
let mut links = Vec::new();
for anchor in root.select("a[href]") {
links.push(anchor["href"].value());
}
assert_eq!(
links,
vec![
"http://example.com/elsie",
"http://example.com/lacie",
"http://example.com/tillie",
]
);
Another common task is extracting all the text from a page:
assert_eq!(
root.text_content().collect::<Vec<_>>(),
[
"\n",
"\n ",
"The Dormouse's story",
"\n",
"\n",
"\n ",
"The Dormouse's story",
"\n ",
"\n Once upon a time there were three little sisters; and their names were\n ",
"Elsie ",
",\n ",
"Lacie ",
"and\n ",
"Tillie",
"; and they lived at the bottom of a well.\n ",
"\n ",
"...",
"\n",
"\n",
],
);
§
Structs§
- Ancestors
- See
Node::ancestors
- Comment
- A comment node
- Descendants
- See
Node::descendants
- Doctype
- A doctype node
- Element
- An HTML element like
<p>
or<div>
. - Err
- An error that occurs while tokenizing or parsing a chunk of html
- Following
- See
Node::following
- Node
- Represents a node in an HTML document.
- Node
Arena - Storage for parsed HTML nodes.
- Preceding
- See
Node::preceding
- Selected
- See
Node::select
- Text
- A text node