Crate htmlite

Crate htmlite 

Source
Expand description

§Quickstart

This introduction is heavily based that of python’s beautifulsoup

We’ll be using the following HTML fragment as an example throughout:

let html = r#"
<head>
 <title>The Dormouse's story</title>
</head>
<body>
 <p class="title"><b>The Dormouse's story </b></p>
 <p class="intro">
  Once upon a time there were three little sisters; and their names were
  <a class="sister" href="http://example.com/elsie" id="link1">Elsie </a>,
  <a class="sister" href="http://example.com/lacie" id="link2">Lacie </a>and
  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.
 </p>
 <p class="story">...</p>
</body>
"#;

The HTML will be parsed into an arena allocator, which we will need to create next. The lifetimes of all the parsed nodes & elements will be tied to this object. Once it goes out of scope, everything is cleaned up.

let arena = htmlite::NodeArena::new();

Next, we’ll parse the HTML string. You’ll get back a fragment node which contains the top-level <head> and <body> elements. It also contains 3 whitespace text nodes; before <head>, between <head> and <body>, and after <body>.

let fragment = htmlite::parse(&arena, html).unwrap();
let child_elements = fragment
    .children()
    .filter_map(htmlite::Node::as_element)
    .collect::<Vec<_>>();
let child_text = fragment
    .children()
    .filter_map(htmlite::Node::as_text)
    .collect::<Vec<_>>();
assert_eq!(child_elements.len(), 2);
assert_eq!(child_text.len(), 3);
assert_eq!(child_elements[0].name, "head");
assert_eq!(child_elements[1].name, "body");

Here are some ways to navigate around:

let title = fragment.select("title").next().unwrap();
assert_eq!(title.outer_html(), "<title>The Dormouse&#39;s story</title>");
assert_eq!(title.name, "title");
assert_eq!(title.text_content().collect::<String>(), "The Dormouse's story");
assert_eq!(title.parent().and_then(htmlite::Node::as_element).unwrap().name, "head");

let first_p = fragment.select("p").next().unwrap();
assert_eq!(first_p.outer_html(), r#"<p class="title"><b>The Dormouse&#39;s story</b></p>"#);
assert_eq!(first_p["class"].value(), "title");

One common task is extracting all the URLs found within a page’s <a> tags:

let mut links = Vec::new();
for anchor in fragment.select("a[href]") {
    links.push(anchor["href"].value());
}
assert_eq!(
    links,
    vec![
        "http://example.com/elsie",
        "http://example.com/lacie",
        "http://example.com/tillie",
    ]
);

Individual nodes within the tree are immutable, but you can create new nodes and detach others, allowing you to manipulate the tree structure.

let new_story = arena.fragment([
    arena.p(htmlite::Attr::Class("story"), arena.text("`What did they live on?' said Alice, who always took a great interest in questions of eating and drinking.")),
    arena.footer(None, arena.text("The end"))
]);

let old_story = chunk.select(".story").next().unwrap();

old_story.insert_after(new_story);
old_story.detach();

assert_eq!(
    chunk.text_content().collect::<Vec<_>>(),
    [
        "\n",
        "\n ",
        "The Dormouse's story",
        "\n",
        "\n",
        "\n ",
        "The Dormouse's story",
        "\n ",
        "\n  Once upon a time there were three little sisters; and their names were\n  ",
        "Elsie ",
        ",\n  ",
        "Lacie ",
        "and\n  ",
        "Tillie",
        "; and they lived at the bottom of a well.\n ",
        "\n ",
        "`What did they live on?' said Alice, who always took a great interest in questions of eating and drinking.",
        "The end",
        "\n",
        "\n",
    ],
);

Structs§

Ancestors
See Node::ancestors
Comment
A comment node
Descendants
See Node::descendants
Doctype
A doctype node
Element
An HTML element like <p> or <div>.
Err
An error that occurs while tokenizing or parsing a chunk of html
Following
See Node::following
Node
Represents a node in an HTML document.
NodeArena
Storage for parsed HTML nodes.
Preceding
See Node::preceding
Selected
See Node::select
Text
A text node

Enums§

Attr
An HTML attribute
NodeKind
Types of nodes that might exist in an HTML document.

Functions§

parse
Parses the given HTML fragment and returns a fragment containing the parsed nodes.