htmlite
For when you want to parse & manipulate a chunk of html tags.
A lightweight library implementing an HTML parser that recognizes the "bare" HTML syntax without enforcing its semantics.
The AST returned reflects exactly what you wrote, not what the spec says it should be.
This is a "What-You-See-Is-What-You-Get" HTML parser.
It also implements methods to navigate the resulting node tree.
Examples
Parsing a chunk of html
let arena = new;
let root = arena.parse.unwrap;
Selecting elements
let html = r#"
<ul>
<li>Foo</li>
<li>Bar</li>
<li>Baz</li>
</ul>
"#;
let arena = new;
let root = arena.parse.unwrap;
for element in root.select
Accessing element attributes
let arena = new;
let root = arena.parse.unwrap;
let element = root.select.next.unwrap;
assert_eq!;
assert_eq!;
Serializing HTML and inner HTML
let arena = new;
let root = arena.parse.unwrap;
let h1 = root.select.next.unwrap;
assert_eq!;
assert_eq!;
Accessing descendent text
let arena = new;
let root = arena.parse.unwrap;
let h1 = root.select.next.unwrap;
assert_eq!;
Manipulating the DOM
let html = "<html><body>hello<p class=\"hello\">REMOVE ME</p></body></html>";
let arena = new;
let root = arena.parse.unwrap;
for el in root.select
assert_eq!
Use cases:
- In static site generation pipelines (e.g. processing HTML emitted by markdown parsers)
- As a base for implementing additional syntax on top of html-like text (e.g. HTML templating languages)
Motivation
Take this html fragment:
Hello
According to the spec, this is an "invalid" html fragment roughly because:
- There is
<td>
without a<table>
<div>
is not allowed in<strong>
Spec compliant parsers handle this by re-structuring your markup as part of the spec's error recovery algorithms. This useful when parsing arbitrary HTML from the Web. Not so much when you just want an AST out of what is obviously a well-nested html chunk.
This crate fills that gap. If you feed that chunk to this library's parser, you will get an AST with exactly those elements.
Why not ...
... an XML parser?
HTML and XML are different enough to be annoying:
- XML does not allow unquoted attribute values. This is a syntactically valid HTML tag that I'd like to properly parse:
<div attr=unquoted>
- I want to support stringifying the parsed fragment back to a string. XML parsers tend to inject a bunch of XML-namespace stuff that I do not want.
- HTML has void elements that cannot be handled by an XML parser:
<link>
. For that to be valid XML, you have to "self-close" it like<link />
.
... a browser-grade HTML parsing in "fragment mode":
Spec compliant "fragment parsing" will still restructure your html. Example Example
fn main() {
let frag = scraper::Html::parse_fragment(" blah ");
dbg!(frag.html());
// [src/main.rs:4:5] frag.html() = " blah "
}
Error recovery
This crate performs none of it. The first error encountered is returned.
Adjacent crates
scraper: An inspiration for this crate. Uses html5ever. You get browser-grade html parsing with a browser-grade dependency tree.
kuchiki: As far as I understand this was the predecessor to scraper. Same thing about html5ever.
tl: A bit too lenient, while also failing on valid html. Additionally it does some weird error recovery that I did not want.
html5gum: Only tokenizes. I could have used this instead of writing my own tokenizer ... but where is the fun in that.
lol-html: Very odd API. A bit too dependency heavy for my liking. Different use case
Thank you
This crate would not be possible without SimonSapin's rust-forest experiment. The combination of using an Arena allocator and Cell-wrapped references is at the root of why this API is as ergonomic as it is. Brilliant design. Thank you for you work!