htmlite 0.8.0

An HTML manipulation toolkit
Documentation

htmlite

An HTML manipulation toolkit

htmlite is lightweight library implementing an HTML parser that recognizes the bare HTML syntax without enforcing its semantics. The returned AST reflects exactly what you wrote, not what the spec says it should be. The library also includes a declarative macro for building an HTML AST in Rust, as well as methods for manipulating and navigating the AST.

Examples

Parsing a chunk of html

let arena = htmlite::NodeArena::new();
let chunk = htmlite::parse(&arena, "<h1>Hello, <i>world!</i></h1>").unwrap();

Selecting elements

let html = r#"
    <ul>
        <li>Foo</li>
        <li>Bar</li>
        <li>Baz</li>
    </ul>
"#;

let arena = htmlite::NodeArena::new();
let chunk = htmlite::parse(&arena, html).unwrap();

for element in chunk.select("li") {
    assert_eq!(element.name, "li");
}

Accessing element attributes

let arena = htmlite::NodeArena::new();
let chunk = htmlite::parse(&arena, r#"<input name="foo" value="bar" readonly>"#).unwrap();

let element = chunk.select(r#"input[name="foo"]"#).next().unwrap();
assert_eq!(element["value"], htmlite::Attribute::Regular("value", "bar"));
assert_eq!(element["readonly"], htmlite::Attribute::Boolean("readonly"));

Serializing HTML and inner HTML

let arena = htmlite::NodeArena::new();
let chunk = htmlite::parse(&arena, "<h1>Hello, <i>world!</i></h1>").unwrap();
let h1 = chunk.select("h1").next().unwrap();
assert_eq!(h1.outer_html(), "<h1>Hello, <i>world!</i></h1>");
assert_eq!(h1.inner_html(), "Hello, <i>world!</i>");

Accessing descendent text

let arena = htmlite::NodeArena::new();
let chunk = htmlite::parse(&arena, "<h1>Hello, <i>world!</i></h1>").unwrap();
let h1 = chunk.select("h1").next().unwrap();
assert_eq!(h1.text_content().collect::<String>(), "Hello, world!");

Manipulating the DOM

let html = "<html><body>hello<p class=\"hello\">REMOVE ME</p></body></html>";
let arena = htmlite::NodeArena::new();
let chunk = htmlite::parse(&arena, html).unwrap();
for el in chunk.select(".hello") {
    el.detach();
}
assert_eq!(chunk.outer_html(), "<html><body>hello</body></html>")

When should you use this?

This crate's parser will bail on the first parse error. So it is likely to be useless for general web scraping because most HTML has parse errors.

This crate is only useful when you are sure the HTML you are feeding it has no syntax errors.

Examples:

  • In static site generation pipelines (e.g. processing HTML emitted by markdown parsers)
  • As a base for implementing additional syntax on top of html-like text (e.g. HTML templating languages)

Motivation

Take this html fragment:

<td>
    <strong>
        <div>
            Hello
        </div>
    </strong>
</td>

According to the spec, this is an "invalid" html fragment roughly because:

  • There is <td> without a <table>
  • <div> is not allowed in <strong>

Spec compliant parsers handle this by re-structuring your markup as part of the spec's error recovery algorithms. This useful when parsing arbitrary HTML from the Web. Not so much when you just want an AST out of what is obviously a well-nested html chunk.

This crate fills that gap. If you feed that chunk to this library's parser, you will get an AST with exactly those elements.

Why not ...

... an XML parser?

HTML and XML are different enough to be annoying:

  • XML does not allow unquoted attribute values. This is a syntactically valid HTML tag that I'd like to properly parse: <div attr=unquoted>
  • I want to support stringifying the parsed fragment back to a string. XML parsers tend to inject a bunch of XML-namespace stuff that I do not want.
  • HTML has void elements that cannot be handled by an XML parser: <link>. For that to be valid XML, you have to "self-close" it like <link />.

... a browser-grade HTML parsing in "fragment mode":

Spec compliant "fragment parsing" will still restructure your html. Example Example

fn main() {
    let frag = scraper::Html::parse_fragment("<td>blah</td>");
    dbg!(frag.html());
    // [src/main.rs:4:5] frag.html() = "<html>blah</html>"
}

Adjacent crates

scraper: An inspiration for this crate. Uses html5ever. You get browser-grade html parsing with a browser-grade dependency tree.

kuchiki: As far as I understand this was the predecessor to scraper. Same thing about html5ever.

tl: A bit too lenient, while also failing on valid html. Additionally it does some weird error recovery that I did not want.

html5gum: Only tokenizes. I could have used this instead of writing my own tokenizer ... but where is the fun in that.

lol-html: Very odd API. A bit too dependency heavy for my liking. Different use case

Thank you

This crate would not be possible without SimonSapin's rust-forest experiment. The combination of using an Arena allocator and Cell-wrapped references is at the root of why this API is as ergonomic as it is. Brilliant design. Thank you for you work!