html5tokenizer 0.2.0

The HTML5 tokenizer from html5ever repackaged with its dependencies removed
Documentation
html5tokenizer-0.2.0 has been yanked.

html5tokenizer

This crate provides the tokenizer from html5ever, repackaged with all of its dependencies removed. The following dependencies were removed:

  • markup5ever
    buffer_queue, smallcharset and the entity data were merged into the source code

  • tendril
    According to its README it contains "a substantial amount of unsafe code". This fork replaces the tendril strings with plain old std::string::Strings.

  • mac
    The only macros actually needed (format_if and test_eq) were merged into the source code.

  • log
    Was only used for debug output.

If you want to parse HTML into a tree (DOM) you should by all means use html5ever, this crate is merely for those who only want an HTML5 tokenizer and seek to minimize their compile dependencies (html5ever pulls in 56).

To efficiently resolve named entities like & the tokenizer uses phf for a compile-time static map. If you don't need to resolve named entities, you can avoid the phf dependency by disabling the named-entities feature (which is enabled by default).

Credits

Thanks to the developers of html5ever for their awesome parser!