htmlite
An HTML manipulation toolkit
htmlite
is lightweight html toolkit for parsing, manipulating and generating HTML.
Examples
Parsing a fragment of html
use NodeArena;
let arena = new;
parse.unwrap;
Selecting elements
use ;
let html = r#"
<ul>
<li>Foo</li>
<li>Bar</li>
<li>Baz</li>
</ul>
"#;
let arena = new;
let root = parse.unwrap;
for element in root.descendants.select
Accessing element attributes
use ;
let arena = new;
let root = parse.unwrap;
let element = root.descendants.select.next.unwrap;
assert_eq!;
assert_eq!;
Serializing HTML and inner HTML
use ;
let arena = new;
let root = parse.unwrap;
let h1 = root.descendants.select.next.unwrap;
assert_eq!;
assert_eq!;
Manipulating the DOM
use ;
let html = "<html><body>hello<p class=\"hello\">REMOVE ME</p></body></html>";
let arena = new;
let root = parse.unwrap;
for el in root.descendants.select
assert_eq!
Generating HTML
let h = new;
let form = h.form;
assert_eq!;
When should you use this?
This is not a "browser-grade" HTML parser, but it is close!
Specifically, the tokenizer is spec compliant and passes all the html5lib tokenizer tests.
So htmlite
will accept any valid HTML "construct" like numeric & named character references and void elements.
However, the tree-builder does not follow the spec. This was done on purpose. A spec compliant tree-builder may restructure your markup for multitude of reasons: badly nested tags, child elements that don't conform to the content model of their parent, missing end tags etc ... The tree-builder in this library takes a simpler approach: it will parse any well-balanced HTML and output a tree that corresponds to that markup, exactly as written.
So this library will work well when you are parsing the output of HTML-generating tools like SSGs or markdown parser. Tools like these don't forget to add end tags :)
On the other hand, parsing random web content is more of a gamble.
For example, many sites rely on the fact that you do not need to close your <p>
tags.
This library will fail on such markup.
TLDR; If your HTML looks like well-formed XML if you squint, this library's HTML parser is for you.
Adjacent crates
scraper: An inspiration for this crate. Uses html5ever. You get browser-grade html parsing with a browser-grade dependency tree.
kuchiki: As far as I understand this was the predecessor to scraper. Same thing about html5ever.
tl: A bit too lenient, while also failing on valid html. Additionally it does some weird error recovery that I did not want.
html5gum: Only tokenizes. I could have used this instead of writing my own tokenizer ... but where is the fun in that.
lol-html: Very odd API. A bit too dependency heavy for my liking. Different use case
Thank you
This crate would not be possible without SimonSapin's rust-forest experiment. The combination of using an Arena allocator and Cell-wrapped references is at the root of why this API is as ergonomic as it is. Brilliant design. Thank you for you work!