htmlite 0.1.0

Parsing properly nested HTML tags
Documentation

htmlite

For when you want to parse a chunk of html tags.

This crate implements a parser that recognizes the "bare" HTML syntax without enforcing HTML semantics.

The library will correctly tokenize the following according to the HTML spec:

  • Start and end tags (e.g <div>...</div>)
  • Void tags (<img> or <img/>)
  • Tag attributes: boolean, unquoted values, single quoted and double quoted
  • Character references: Named, decimal and hexadecimal

However, the AST returned reflects exactly what you wrote, not what the spec says it should be. This is a "What-You-See-Is-What-You-Get" HTML parser.

Use cases:

  • In static site generation pipelines (e.g. processing HTML emitted by markdown parsers)
  • As a base for implementing additional syntax on top of html-like text.

Motivation

Take this html fragment:

<td>
    <strong>
        <div>
            Hello
        </div>
    </strong>
</td>

According to the spec, this is an "invalid" html fragment roughly because:

  • There is <td> without a <table>
  • <div> is not allowed in <strong>

Spec compliant parsers handle this by re-structuring your markup as part of the spec's error recovery algorithms. This useful when parsing arbitrary HTML from the Web. Not so much when you just want an AST out of what is obviously a well-nested html chunk.

This crate fills that gap. If you feed that chunk to this library's parser, you will get an AST with exactly those elements.

Why not ...

... an XML parser?

HTML and XML are different enough to be annoying. Here are some:

  • XML does not allow unquoted attribute values. This is a syntactically valid HTML tag that I'd like to properly parse: <div attr=unquoted>
  • I want to support stringifying the parsed fragment back to a string. XML parsers tend to inject a bunch of XML-namespace stuff that I do not want.
  • HTML has void elements that cannot be handled by an XML parser: <link>. For that to be valid XML, you have to "self-close" it like <link />.

... a browser-grade HTML parsing in "fragment mode":

Spec compliant "fragment parsing" will still restructure your html. Example Example

fn main() {
    let frag = scraper::Html::parse_fragment("<td>blah</td>");
    dbg!(frag.html());
    // [src/main.rs:4:5] frag.html() = "<html>blah</html>"
}

Error recovery

This crate performs none of it. The first error encountered is returned.

Adjacent crates

tl: A bit too lenient, while also failing on valid html. Additionally it does some weird error recovery that I did not want.

(html5gum)[https://crates.io/crates/html5gum]: Only tokenizes. I could have used this instead of writing my own tokenizer ... but where is the fun in that.

(lol-html)[https://crates.io/crates/lol_html]: Very odd API. A bit too dependency heavy for my liking. Different use case