htmlite 0.12.0

An HTML manipulation toolkit
Documentation

htmlite

An HTML manipulation toolkit

htmlite is lightweight html toolkit for parsing, manipulating and generating HTML.

Examples

Parsing a fragment of html

use htmlite::NodeArena;
let arena = NodeArena::new();
htmlite::parse(&arena, "<h1>Hello, <i>world!</i></h1>").unwrap();

Selecting elements

use htmlite::{NodeArena, Node};
let html = r#"
    <ul>
        <li>Foo</li>
        <li>Bar</li>
        <li>Baz</li>
    </ul>
"#;

let arena = NodeArena::new();
let root = htmlite::parse(&arena, html).unwrap();

for element in root.descendants().select("li") {
    assert_eq!(&*element.name(), "li");
}

Accessing element attributes

use htmlite::{NodeArena, Node};
let arena = NodeArena::new();
let root = htmlite::parse(&arena, r#"<input name="foo" value="bar" readonly>"#).unwrap();

let element = root.descendants().select(r#"input[name="foo"]"#).next().unwrap();
assert_eq!(element.attr("value").as_deref(), Some("bar"));
assert_eq!(element.attr("readonly").as_deref(), Some(""));

Serializing HTML and inner HTML

use htmlite::{NodeArena};
let arena = NodeArena::new();
let root = htmlite::parse(&arena, "<h1>Hello, <i>world!</i></h1>").unwrap();
let h1 = root.descendants().select("h1").next().unwrap();
assert_eq!(h1.html(), "<h1>Hello, <i>world!</i></h1>");
assert_eq!(h1.inner_html(), "Hello, <i>world!</i>");

Manipulating the DOM

use htmlite::{NodeArena};
let html = "<html><body>hello<p class=\"hello\">REMOVE ME</p></body></html>";
let arena = NodeArena::new();
let root = htmlite::parse(&arena, html).unwrap();
for el in root.descendants().select(".hello") {
    el.detach();
}
assert_eq!(root.html(), "<html><body>hello</body></html>")

Generating HTML

let h = htmlite::NodeArena::new();
let form = h.form(
    [("method", "POST")],
    [
        h.input([("value", "hello"), ("type", "text")], None),
        h.button(None, h.text("Submit"))
    ]
);
assert_eq!(form.html(), r#"<form method="POST"><input value="hello" type="text"><button>Submit</button></form>"#);

When should you use this?

This is not a "browser-grade" HTML parser, but it is close!

Specifically, the tokenizer is spec compliant and passes all the html5lib tokenizer tests. So htmlite will accept any valid HTML "construct" like numeric & named character references and void elements.

However, the tree-builder does not follow the spec. This was done on purpose. A spec compliant tree-builder may restructure your markup for multitude of reasons: badly nested tags, child elements that don't conform to the content model of their parent, missing end tags etc ... The tree-builder in this library takes a simpler approach: it will parse any well-balanced HTML and output a tree that corresponds to that markup, exactly as written.

So this library will work well when you are parsing the output of HTML-generating tools like SSGs or markdown parser. Tools like these don't forget to add end tags :)

On the other hand, parsing random web content is more of a gamble. For example, many sites rely on the fact that you do not need to close your <p> tags. This library will fail on such markup.

TLDR; If your HTML looks like well-formed XML if you squint, this library's HTML parser is for you.

Adjacent crates

scraper: An inspiration for this crate. Uses html5ever. You get browser-grade html parsing with a browser-grade dependency tree.

kuchiki: As far as I understand this was the predecessor to scraper. Same thing about html5ever.

tl: A bit too lenient, while also failing on valid html. Additionally it does some weird error recovery that I did not want.

html5gum: Only tokenizes. I could have used this instead of writing my own tokenizer ... but where is the fun in that.

lol-html: Very odd API. A bit too dependency heavy for my liking. Different use case

Thank you

This crate would not be possible without SimonSapin's rust-forest experiment. The combination of using an Arena allocator and Cell-wrapped references is at the root of why this API is as ergonomic as it is. Brilliant design. Thank you for you work!