htmlite
An HTML manipulation toolkit
htmlite
is lightweight html toolkit for tokenizing, parsing, manipulating and generating HTML.
What's in the box
- A spec-compliant HTML tokenizer.
- Recognizes all the dark corners of HTML syntax (e.g. character references, html script comments (don't ask), and much more ...)
- A simple html tree-builder that builds on the tokenizer.
- "simple" means that that it will parse HTML as long as the tags are well-balanced.
- A node API for navigating the parsed AST/nodes.
- An API for generating HTML.
Examples
Parsing a fragment of html
let arena = new;
parse.unwrap;
Selecting elements
let html = r#"
<ul>
<li>Foo</li>
<li>Bar</li>
<li>Baz</li>
</ul>
"#;
let arena = new;
let Html = parse.unwrap;
for element in root.select
Accessing element attributes
let arena = new;
let Html = parse.unwrap;
let element = root.select.next.unwrap;
assert_eq!;
assert_eq!;
Serializing HTML and inner HTML
let arena = new;
let Html = parse.unwrap;
let h1 = root.select.next.unwrap;
assert_eq!;
assert_eq!;
Manipulating the DOM
let html = "<html><body>hello<p class=\"hello\">REMOVE ME</p></body></html>";
let arena = new;
let Html = parse.unwrap;
for el in root.select
assert_eq!
Generating HTML
let h = new;
let form = h.form;
assert_eq!;
When should you use this?
This crate's tokenizer implements spec-compliant error handling, and does not stop on the first error. The tree builder on the other hand is more naive; it will fail if detects unbalanced tags.
The combination is a parser that will successfully parse any HTML that is output from tools like JSX, markdown-to-html generators and templating languages. Basically: "if it looks right, the parser should be able to handle it".
Motivation
Take this html fragment:
Hello
According to the spec, this is an "non-conforming" html fragment roughly because:
- There is
<td>
without a<table>
<div>
is not allowed in<strong>
Spec compliant parsers handle this by re-structuring your markup as part of the spec's error recovery algorithms. This useful when parsing arbitrary HTML from the Web. Not so much when you just want an AST out of what is obviously a well-nested html fragment.
This crate fills that gap. If you feed that fragment to this library's parser, you will get an AST with exactly those elements.
Adjacent crates
scraper: An inspiration for this crate. Uses html5ever. You get browser-grade html parsing with a browser-grade dependency tree.
kuchiki: As far as I understand this was the predecessor to scraper. Same thing about html5ever.
tl: A bit too lenient, while also failing on valid html. Additionally it does some weird error recovery that I did not want.
html5gum: Only tokenizes. I could have used this instead of writing my own tokenizer ... but where is the fun in that.
lol-html: Very odd API. A bit too dependency heavy for my liking. Different use case
Thank you
This crate would not be possible without SimonSapin's rust-forest experiment. The combination of using an Arena allocator and Cell-wrapped references is at the root of why this API is as ergonomic as it is. Brilliant design. Thank you for you work!