Expand description
html5gum
html5gum
is a WHATWG-compliant HTML tokenizer.
use std::fmt::Write;
use html5gum::{Tokenizer, Token};
let html = "<title >hello world</title>";
let mut new_html = String::new();
for token in Tokenizer::new(html).infallible() {
match token {
Token::StartTag(tag) => {
write!(new_html, "<{}>", String::from_utf8_lossy(&tag.name)).unwrap();
}
Token::String(hello_world) => {
write!(new_html, "{}", String::from_utf8_lossy(&hello_world)).unwrap();
}
Token::EndTag(tag) => {
write!(new_html, "</{}>", String::from_utf8_lossy(&tag.name)).unwrap();
}
_ => panic!("unexpected input"),
}
}
assert_eq!(new_html, "<title>hello world</title>");
What a tokenizer does and what it does not do
html5gum
fully implements 13.2.5 of the WHATWG HTML
spec, i.e. is able to tokenize HTML documents and passes html5lib’s tokenizer
test suite. Since it is just a tokenizer, this means:
html5gum
does not implement charset detection. This implementation takes and returns bytes, but assumes UTF-8. It recovers gracefully from invalid UTF-8.html5gum
does not correct mis-nested tags.html5gum
does not recognize implicitly self-closing elements like<img>
, as a tokenizer it will simply emit a start token. It does however emit a self-closing tag for<img .. />
.html5gum
doesn’t implement the DOM, and unfortunately in the HTML spec, constructing the DOM (“tree construction”) influences how tokenization is done. For an example of which problems this causes see this example code.html5gum
does not generally qualify as a browser-grade HTML parser as per the WHATWG spec. This can change in the future, see issue 21.
With those caveats in mind, html5gum
can pretty much parse tokenize
anything that browsers can.
The Emitter
trait
A distinguishing feature of html5gum
is that you can bring your own token
datastructure and hook into token creation by implementing the Emitter
trait.
This allows you to:
-
Rewrite all per-HTML-tag allocations to use a custom allocator or datastructure.
-
Efficiently filter out uninteresting categories data without ever allocating for it. For example if any plaintext between tokens is not of interest to you, you can implement the respective trait methods as noop and therefore avoid any overhead creating plaintext tokens.
See the custom_emitter
example for how this
looks like in practice.
Other features
- No unsafe Rust
- Only dependency is
jetscii
, and can be disabled via crate features (seeCargo.toml
)
Alternative HTML parsers
html5gum
was created out of a need to parse HTML tag soup efficiently. Previous options were to:
-
use quick-xml or xmlparser with some hacks to make either one not choke on bad HTML. For some (rather large) set of HTML input this works well (particularly
quick-xml
can be configured to be very lenient about parsing errors) and parsing speed is stellar. But neither can parse all HTML.For my own usecase
html5gum
is about 2x slower thanquick-xml
. -
use html5ever’s own tokenizer to avoid as much tree-building overhead as possible. This was functional but had poor performance for my own usecase (10-15x slower than
quick-xml
). -
use lol-html, which would probably perform at least as well as
html5gum
, but comes with a closure-based API that I didn’t manage to get working for my usecase.
Etymology
Why is this library called html5gum
?
-
G.U.M: Giant Unreadable Match-statement
-
<insert “how it feels to
chew 5 gumparse HTML” meme here>
License
Licensed under the MIT license, see ./LICENSE
.
Modules
- Module of helper functions for integration tests.
Structs
- The default implementation of
crate::Emitter
, used to produce (“emit”) tokens. - A doctype. Some examples:
- A HTML end/close tag, such as
</p>
or</a>
. - A wrapper around a bytestring.
- A kind of tokenizer that directly yields tokens when used as an iterator, so
Token
instead ofResult<Token, _>
. - A
IoReader
can be used to construct a tokenizer from any type that implementsstd::io::Read
. - A HTML end/close tag, such as
<p>
or<a>
. - A helper struct to seek forwards and backwards in strings. Used by the tokenizer to read HTML from strings.
- A HTML tokenizer. See crate-level docs for basic usage.
Enums
- All parsing errors this tokenizer can emit.
- Tokenizer that the tokenizer can be switched to from within the emitter.
- The token type used by default. You can define your own token type by implementing the
crate::Emitter
trait and usingcrate::Tokenizer::new_with_emitter
.
Traits
- An emitter is an object providing methods to the tokenizer to produce tokens.
- An object that can be converted into a
crate::Reader
. - An object that provides characters to the tokenizer.
Functions
- Take an educated guess at the next state using the name of a just-now emitted start tag.