Crate html5tokenizer[][src]

Expand description

html5tokenizer

docs.rs crates.io

html5tokenizer is a WHATWG-compliant HTML tokenizer (forked from html5gum with added code span support).

use std::fmt::Write;
use html5tokenizer::{Tokenizer, Token};

let html = "<title   >hello world</title>";
let mut new_html = String::new();

for token in Tokenizer::new(html).infallible() {
    match token {
        Token::StartTag(tag) => {
            write!(new_html, "<{}>", tag.name).unwrap();
        }
        Token::String(hello_world) => {
            write!(new_html, "{}", hello_world).unwrap();
        }
        Token::EndTag(tag) => {
            write!(new_html, "</{}>", tag.name).unwrap();
        }
        _ => panic!("unexpected input"),
    }
}

assert_eq!(new_html, "<title>hello world</title>");

What a tokenizer does and what it does not do

html5tokenizer fully implements 13.2.5 of the WHATWG HTML spec, i.e. is able to tokenize HTML documents and passes html5lib’s tokenizer test suite. Since it is just a tokenizer, this means:

  • html5tokenizer does not implement charset detection. This implementation requires all input to be Rust strings and therefore valid UTF-8.
  • html5tokenizer does not correct mis-nested tags.
  • html5tokenizer does not recognize implicitly self-closing elements like <img>, as a tokenizer it will simply emit a start token. It does however emit a self-closing tag for <img .. />.
  • html5tokenizer does not generally qualify as a browser-grade HTML parser as per the WHATWG spec. This can change in the future.

With those caveats in mind, html5tokenizer can pretty much ~parse~ tokenize anything that browsers can.

The Emitter trait

A distinguishing feature of html5tokenizer is that you can bring your own token datastructure and hook into token creation by implementing the Emitter trait. This allows you to:

  • Rewrite all per-HTML-tag allocations to use a custom allocator or datastructure.

  • Efficiently filter out uninteresting categories data without ever allocating for it. For example if any plaintext between tokens is not of interest to you, you can implement the respective trait methods as noop and therefore avoid any overhead creating plaintext tokens.

License

Licensed under the MIT license, see ./LICENSE.

Modules

Source code spans.

Structs

A HTML attribute value (plus spans).

A BufReadReader can be used to construct a tokenizer from any type that implements BufRead.

The default implementation of crate::Emitter, used to produce (“emit”) tokens.

A doctype. Some examples:

A HTML end/close tag, such as </p> or </a>.

A kind of tokenizer that directly yields tokens when used as an iterator, so Token instead of Result<Token, _>.

A HTML end/close tag, such as <p> or <a>.

A helper struct to seek forwards and backwards in strings. Used by the tokenizer to read HTML from strings.

A HTML tokenizer. See crate-level docs for basic usage.

Enums

All parsing errors this tokenizer can emit.

Definition of an empty enum.

The states you can set the tokenizer to.

The token type used by default. You can define your own token type by implementing the crate::Emitter trait and using crate::Tokenizer::new_with_emitter.

Traits

An emitter is an object providing methods to the tokenizer to produce tokens.

An object that can be converted into a crate::Reader.

An object that provides characters to the tokenizer.