Crate htmlite

Crate htmlite 

Source
Expand description

§Quickstart

This introduction is heavily based that of python’s beautifulsoup

We’ll be using the following HTML fragment as an example throughout:

let html = r#"
<head>
 <title>The Dormouse's story</title>
</head>
<body>
 <p class="title"><b>The Dormouse's story </b></p>
 <p class="intro">
  Once upon a time there were three little sisters; and their names were
  <a class="sister" href="http://example.com/elsie" id="link1">Elsie </a>,
  <a class="sister" href="http://example.com/lacie" id="link2">Lacie </a>and
  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.
 </p>
 <p class="story">...</p>
</body>
"#;

The HTML will be parsed into an arena allocator, which we will need to create next. The lifetimes of all the parsed nodes & elements will be tied to this object. Once it goes out of scope, everything is cleaned up.

use htmlite::{NodeArena, Html};

let arena = NodeArena::new();

Next, we’ll parse the HTML string. You’ll get back a fragment node which contains the top-level <head> and <body> elements. It also contains 3 whitespace text nodes; before <head>, between <head> and <body>, and after <body>.

let Html { root, .. } = htmlite::parse(&arena, html).unwrap();
let child_elements = root
    .children()
    .filter_map(htmlite::Node::as_element)
    .collect::<Vec<_>>();
let child_text = root
    .children()
    .filter_map(htmlite::Node::as_text)
    .collect::<Vec<_>>();
assert_eq!(child_elements.len(), 2);
assert_eq!(child_text.len(), 3);
assert_eq!(child_elements[0].name, "head");
assert_eq!(child_elements[1].name, "body");

Here are some ways to navigate around:

let title = root.select("title").next().unwrap();
assert_eq!(title.outer_html(), "<title>The Dormouse's story</title>");
assert_eq!(title.name, "title");
assert_eq!(title.text_content().collect::<String>(), "The Dormouse's story");
assert_eq!(title.parent().and_then(htmlite::Node::as_element).unwrap().name, "head");

let first_p = root.select("p").next().unwrap();
assert_eq!(first_p.outer_html(), r#"<p class="title"><b>The Dormouse's story</b></p>"#);
assert_eq!(&first_p["class"], "title");

One common task is extracting all the URLs found within a page’s <a> tags:

let mut links = Vec::new();
for anchor in root.select("a[href]") {
    links.push(&anchor["href"]);
}
assert_eq!(
    links,
    vec![
        "http://example.com/elsie",
        "http://example.com/lacie",
        "http://example.com/tillie",
    ]
);

Individual nodes within the tree are immutable, but you can create new nodes and detach others, allowing you to manipulate the tree structure.

let new_story = arena.fragment([
    arena.p([("class", "story")], arena.text("`What did they live on?' said Alice, who always took a great interest in questions of eating and drinking.")),
    arena.footer(None, arena.text("The end"))
]);

let old_story = root.select(".story").next().unwrap();

old_story.insert_after(new_story);
old_story.detach();

assert_eq!(
    root.text_content().collect::<Vec<_>>(),
    [
        "\n",
        "\n ",
        "The Dormouse's story",
        "\n",
        "\n",
        "\n ",
        "The Dormouse's story",
        "\n ",
        "\n  Once upon a time there were three little sisters; and their names were\n  ",
        "Elsie ",
        ",\n  ",
        "Lacie ",
        "and\n  ",
        "Tillie",
        "; and they lived at the bottom of a well.\n ",
        "\n ",
        "`What did they live on?' said Alice, who always took a great interest in questions of eating and drinking.",
        "The end",
        "\n",
        "\n",
    ],
);

Modules§

tokenizer

Structs§

Ancestors
See Node::ancestors
Comment
A comment node
Descendants
See Node::descendants
Doctype
A doctype node
Element
An HTML element like <p> or <div>.
Following
See Node::following
Html
An HTML tree.
Node
Represents a node in an HTML document.
NodeArena
Storage for parsed HTML nodes.
Preceding
See Node::preceding
Selected
See Node::select
Text
A text node
TreeConstructionError
An error that occurs while constructing an HTML tree.

Enums§

NodeKind
Types of nodes that might exist in an HTML document.

Functions§

parse
Parses the given HTML fragment.