htmlite

An HTML manipulation toolkit

htmlite is lightweight html toolkit for tokenizing, parsing, manipulating and generating HTML.

What's in the box

A spec-compliant HTML tokenizer.
- Recognizes all the dark corners of HTML syntax (e.g. character references, html script comments (don't ask), and much more ...)
A simple html tree-builder that builds on the tokenizer.
- "simple" means that that it will parse HTML as long as the tags are well-balanced.
A node API for navigating the parsed AST/nodes.
An API for generating HTML.

Examples

Parsing a fragment of html

let arena = htmlite::NodeArena::new();
htmlite::parse(&arena, "<h1>Hello, <i>world!</i></h1>").unwrap();

Selecting elements

let html = r#"
    <ul>
        <li>Foo</li>
        <li>Bar</li>
        <li>Baz</li>
    </ul>
"#;

let arena = htmlite::NodeArena::new();
let htmlite::Html { root, .. } = htmlite::parse(&arena, html).unwrap();

for element in root.select("li") {
    assert_eq!(element.name, "li");
}

Accessing element attributes

let arena = htmlite::NodeArena::new();
let htmlite::Html { root, .. } = htmlite::parse(&arena, r#"<input name="foo" value="bar" readonly>"#).unwrap();

let element = root.select(r#"input[name="foo"]"#).next().unwrap();
assert_eq!(&element["value"], "bar");
assert_eq!(&element["readonly"], "");

Serializing HTML and inner HTML

let arena = htmlite::NodeArena::new();
let htmlite::Html { root, .. } = htmlite::parse(&arena, "<h1>Hello, <i>world!</i></h1>").unwrap();
let h1 = root.select("h1").next().unwrap();
assert_eq!(h1.outer_html(), "<h1>Hello, <i>world!</i></h1>");
assert_eq!(h1.inner_html(), "Hello, <i>world!</i>");

Manipulating the DOM

let html = "<html><body>hello<p class=\"hello\">REMOVE ME</p></body></html>";
let arena = htmlite::NodeArena::new();
let htmlite::Html { root, .. } = htmlite::parse(&arena, html).unwrap();
for el in root.select(".hello") {
    el.detach();
}
assert_eq!(root.outer_html(), "<html><body>hello</body></html>")

Generating HTML

let h = htmlite::NodeArena::new();
let form = h.form(
    [("method", "POST")],
    [
        h.input([("value", "hello"), ("type", "text")], None),
        h.button(None, h.text("Submit"))
    ]
);
assert_eq!(form.outer_html(), r#"<form method="POST"><input value="hello" type="text"><button>Submit</button></form>"#);

When should you use this?

This crate's tokenizer implements spec-compliant error handling, and does not stop on the first error. The tree builder on the other hand is more naive; it will fail if detects unbalanced tags.

The combination is a parser that will successfully parse any HTML that is output from tools like JSX, markdown-to-html generators and templating languages. Basically: "if it looks right, the parser should be able to handle it".

Motivation

Take this html fragment:

<td>
    <strong>
        <div>
            Hello
        </div>
    </strong>
</td>

According to the spec, this is an "non-conforming" html fragment roughly because:

There is <td> without a <table>
<div> is not allowed in <strong>

Spec compliant parsers handle this by re-structuring your markup as part of the spec's error recovery algorithms. This useful when parsing arbitrary HTML from the Web. Not so much when you just want an AST out of what is obviously a well-nested html fragment.

This crate fills that gap. If you feed that fragment to this library's parser, you will get an AST with exactly those elements.

Adjacent crates

scraper: An inspiration for this crate. Uses html5ever. You get browser-grade html parsing with a browser-grade dependency tree.

kuchiki: As far as I understand this was the predecessor to scraper. Same thing about html5ever.

tl: A bit too lenient, while also failing on valid html. Additionally it does some weird error recovery that I did not want.

html5gum: Only tokenizes. I could have used this instead of writing my own tokenizer ... but where is the fun in that.

lol-html: Very odd API. A bit too dependency heavy for my liking. Different use case

Thank you

This crate would not be possible without SimonSapin's rust-forest experiment. The combination of using an Arena allocator and Cell-wrapped references is at the root of why this API is as ergonomic as it is. Brilliant design. Thank you for you work!