Crate html5gum

Expand description

§html5gum

html5gum is a WHATWG-compliant HTML tokenizer.

use std::fmt::Write;
use html5gum::{Tokenizer, Token};

let html = "<title   >hello world</title>";
let mut new_html = String::new();

for Ok(token) in Tokenizer::new(html) {
    match token {
        Token::StartTag(tag) => {
            write!(new_html, "<{}>", String::from_utf8_lossy(&tag.name)).unwrap();
        }
        Token::String(hello_world) => {
            write!(new_html, "{}", String::from_utf8_lossy(&hello_world)).unwrap();
        }
        Token::EndTag(tag) => {
            write!(new_html, "</{}>", String::from_utf8_lossy(&tag.name)).unwrap();
        }
        _ => panic!("unexpected input"),
    }
}

assert_eq!(new_html, "<title>hello world</title>");

html5gum provides multiple kinds of APIs:

Iterating over tokens as shown above.
Implementing your own Emitter for maximum performance, see the custom_emitter.rs example.
A callbacks-based API for a middleground between convenience and performance, see the callback_emitter.rs example.
With the tree-builder feature, html5gum can be integrated with html5ever and scraper. See the scraper.rs example.

§What a tokenizer does and what it does not do

html5gum fully implements 13.2.5 of the WHATWG HTML spec, i.e. is able to tokenize HTML documents and passes html5lib’s tokenizer test suite. Since it is just a tokenizer, this means:

html5gum does not implement charset detection. This implementation takes and returns bytes, but assumes UTF-8. It recovers gracefully from invalid UTF-8.
html5gum does not correct mis-nested tags.
html5gum doesn’t implement the DOM, and unfortunately in the HTML spec, constructing the DOM (“tree construction”) influences how tokenization is done. For an example of which problems this causes see this example code.
html5gum does not generally qualify as a browser-grade HTML parser as per the WHATWG spec. This can change in the future, see issue 21.

With those caveats in mind, html5gum can pretty much ~~parse~~ tokenize anything that browsers can. However, using the experimental tree-builder feature, html5gum can be integrated with html5ever and scraper. See the scraper.rs example.

§Other features

No unsafe Rust
Only dependency is jetscii, and can be disabled via crate features (see Cargo.toml)

§Alternative HTML parsers

html5gum was created out of a need to parse HTML tag soup efficiently. Previous options were to:

use quick-xml or xmlparser with some hacks to make either one not choke on bad HTML. For some (rather large) set of HTML input this works well (particularly quick-xml can be configured to be very lenient about parsing errors) and parsing speed is stellar. But neither can parse all HTML.

For my own usecase html5gum is about 2x slower than quick-xml.
use html5ever’s own tokenizer to avoid as much tree-building overhead as possible. This was functional but had poor performance for my own usecase (10-15x slower than quick-xml).
use lol-html, which would probably perform at least as well as html5gum, but comes with a closure-based API that I didn’t manage to get working for my usecase.

§Etymology

Why is this library called html5gum?

G.U.M: Giant Unreadable Match-statement
<insert “how it feels to ~~chew 5 gum~~ parse HTML” meme here>

§License

Licensed under the MIT license, see ./LICENSE.

Re-exports§

pub use emitters::default::DefaultEmitter;
pub use emitters::default::Doctype;
pub use emitters::default::EndTag;
pub use emitters::default::StartTag;
pub use emitters::default::Token;
pub use emitters::naive_next_state;
pub use emitters::Emitter;

Modules§

emitters: Emitter is a “visitor” on the underlying token stream.

Structs§

HtmlString: A wrapper around a bytestring.
IoReader: A IoReader can be used to construct a tokenizer from any type that implements std::io::Read.
StringReader: A helper struct to seek forwards and backwards in strings. Used by the tokenizer to read HTML from strings.
Tokenizer: A HTML tokenizer. See crate-level docs for basic usage.

Enums§

Error: All parsing errors this tokenizer can emit.
State: Tokenizer that the tokenizer can be switched to from within the emitter.

Traits§

Readable: An object that can be converted into a crate::Reader.
Reader: An object that provides characters to the tokenizer.

Crate html5gumCopy item path