html5tokenizer 0.5.0

An HTML5 tokenizer with code span support.
Documentation

html5tokenizer

docs.rs crates.io

Spec-compliant HTML parsing requires both tokenization and tree-construction. While this crate implements a spec-compliant HTML tokenizer it does not implement any tree-construction. Instead it just provides a NaiveParser that may be used as follows:

use std::fmt::Write;
use html5tokenizer::{NaiveParser, Token};

let html = "<title   >hello world</title>";
let mut new_html = String::new();

for token in NaiveParser::new(html).flatten() {
    match token {
        Token::StartTag(tag) => {
            write!(new_html, "<{}>", tag.name).unwrap();
        }
        Token::String(hello_world) => {
            write!(new_html, "{}", hello_world).unwrap();
        }
        Token::EndTag(tag) => {
            write!(new_html, "</{}>", tag.name).unwrap();
        }
        _ => panic!("unexpected input"),
    }
}

assert_eq!(new_html, "<title>hello world</title>");

Compared to html5gum

html5tokenizer was forked from html5gum 0.2.1.

  • Code span support has been added.
  • The API has been revised.

html5gum has since switched its parsing to operate on bytes, which html5tokenizer doesn't yet support. html5tokenizer does not implement charset detection. This implementation requires all input to be Rust strings and therefore valid UTF-8.

Both crates pass the html5lib tokenizer test suite.

Both crates have an Emitter trait that lets you bring your own token data structure and hook into token creation by implementing the Emitter trait. This allows you to:

  • Rewrite all per-HTML-tag allocations to use a custom allocator or data structure.

  • Efficiently filter out uninteresting categories data without ever allocating for it. For example if any plaintext between tokens is not of interest to you, you can implement the respective trait methods as noop and therefore avoid any overhead creating plaintext tokens.

License

Licensed under the MIT license, see the LICENSE file.