Crate htmlite

Source
Expand description

Parse HTML tag chunks

§Overview

The primary type in this crate is a Node. You can create one by parsing some HTML. Nodes can have at most one parent node, zero or more sibling nodes and zero or more child nodes.

There are 5 types of nodes, and methods to convert between a node and it’s underlying kind. Element is probably the most common; it represents an HTML tag. All node kinds implement Deref<Target = Node>, and so inherit all the Node methods.

§Quickstart

This introduction is heavily based that of python’s beautifulsoup

We’ll be using the following HTML fragment as an example throughout:

let html = r#"
<head>
 <title>The Dormouse's story</title>
</head>
<body>
 <p class="title"><b>The Dormouse's story </b></p>
 <p class="story">
  Once upon a time there were three little sisters; and their names were
  <a class="sister" href="http://example.com/elsie" id="link1">Elsie </a>,
  <a class="sister" href="http://example.com/lacie" id="link2">Lacie </a>and
  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.
 </p>
 <p class="story">...</p>
</body>
"#;

The HTML will be parsed into an arena allocator, which we will need to create next. The lifetimes of all the parsed nodes & elements will be tied to this object. Once it goes out of scope, everything is cleaned up.

let arena = htmlite::NodeArena::new();

Now we can parse the html. You’ll get back the root node which contains all the top level nodes in the fragment. The root nodes contains the <head> and <body> elements. It also contains 3 whitespace text nodes; before <head>, between <head> and <body>, and after <body>.

let root = arena.parse(html).unwrap();
let child_elements = root
    .children()
    .filter_map(htmlite::Node::as_element)
    .collect::<Vec<_>>();
let child_text = root
    .children()
    .filter_map(htmlite::Node::as_text)
    .collect::<Vec<_>>();
assert_eq!(child_elements.len(), 2);
assert_eq!(child_text.len(), 3);
assert_eq!(child_elements[0].name, "head");
assert_eq!(child_elements[1].name, "body");

Here are some ways to navigate the data structure:

let title = root.select("title").next().unwrap();
assert_eq!(title.html(), "<title>The Dormouse&#39;s story</title>");
assert_eq!(title.name, "title");
assert_eq!(title.text_content().collect::<String>(), "The Dormouse's story");
assert_eq!(title.parent().and_then(htmlite::Node::as_element).unwrap().name, "head");

let first_p = root.select("p").next().unwrap();
assert_eq!(first_p.html(), r#"<p class="title"><b>The Dormouse&#39;s story</b></p>"#);
assert_eq!(first_p["class"].value(), "title");

One common task is extracting all the URLs found within a page’s <a> tags:

let mut links = Vec::new();
for anchor in root.select("a[href]") {
    links.push(anchor["href"].value());
}
assert_eq!(
    links,
    vec![
        "http://example.com/elsie",
        "http://example.com/lacie",
        "http://example.com/tillie",
    ]
);

Another common task is extracting all the text from a page:

assert_eq!(
    root.text_content().collect::<Vec<_>>(),
    [
        "\n",
        "\n ",
        "The Dormouse's story",
        "\n",
        "\n",
        "\n ",
        "The Dormouse's story",
        "\n ",
        "\n  Once upon a time there were three little sisters; and their names were\n  ",
        "Elsie ",
        ",\n  ",
        "Lacie ",
        "and\n  ",
        "Tillie",
        "; and they lived at the bottom of a well.\n ",
        "\n ",
        "...",
        "\n",
        "\n",
    ],
);

§

Structs§

Ancestors
See Node::ancestors
Comment
A comment node
Descendants
See Node::descendants
Doctype
A doctype node
Element
An HTML element like <p> or <div>.
Err
An error that occurs while tokenizing or parsing a chunk of html
Following
See Node::following
Node
Represents a node in an HTML document.
NodeArena
Storage for parsed HTML nodes.
Preceding
See Node::preceding
Selected
See Node::select
Text
A text node

Enums§

Attribute
An HTML attribute
NodeKind
Types of nodes that might exist in an HTML document.