Crate html5tokenizer[−][src]

Expand description

html5tokenizer

html5tokenizer is a WHATWG-compliant HTML tokenizer (forked from html5gum with added code span support).

use std::fmt::Write;
use html5tokenizer::{Tokenizer, Token};

let html = "<title   >hello world</title>";
let mut new_html = String::new();

for token in Tokenizer::new(html).infallible() {
    match token {
        Token::StartTag(tag) => {
            write!(new_html, "<{}>", tag.name).unwrap();
        }
        Token::String(hello_world) => {
            write!(new_html, "{}", hello_world).unwrap();
        }
        Token::EndTag(tag) => {
            write!(new_html, "</{}>", tag.name).unwrap();
        }
        _ => panic!("unexpected input"),
    }
}

assert_eq!(new_html, "<title>hello world</title>");

What a tokenizer does and what it does not do

html5tokenizer fully implements 13.2.5 of the WHATWG HTML spec, i.e. is able to tokenize HTML documents and passes html5lib’s tokenizer test suite. Since it is just a tokenizer, this means:

html5tokenizer does not implement charset detection. This implementation requires all input to be Rust strings and therefore valid UTF-8.
html5tokenizer does not correct mis-nested tags.
html5tokenizer does not recognize implicitly self-closing elements like <img>, as a tokenizer it will simply emit a start token. It does however emit a self-closing tag for <img .. />.
html5tokenizer does not generally qualify as a browser-grade HTML parser as per the WHATWG spec. This can change in the future.

With those caveats in mind, html5tokenizer can pretty much ~parse~ tokenize anything that browsers can.

The `Emitter` trait

A distinguishing feature of html5tokenizer is that you can bring your own token datastructure and hook into token creation by implementing the Emitter trait. This allows you to:

Rewrite all per-HTML-tag allocations to use a custom allocator or datastructure.
Efficiently filter out uninteresting categories data without ever allocating for it. For example if any plaintext between tokens is not of interest to you, you can implement the respective trait methods as noop and therefore avoid any overhead creating plaintext tokens.

License

Licensed under the MIT license, see ./LICENSE.

Modules

spans

Source code spans.

Structs

Attribute

A HTML attribute value (plus spans).

BufReadReader

A BufReadReader can be used to construct a tokenizer from any type that implements BufRead.

DefaultEmitter

The default implementation of crate::Emitter, used to produce (“emit”) tokens.

Doctype

A doctype. Some examples:

EndTag

A HTML end/close tag, such as </p> or </a>.

InfallibleTokenizer

A kind of tokenizer that directly yields tokens when used as an iterator, so Token instead of Result<Token, _>.

StartTag

A HTML end/close tag, such as <p> or <a>.

StringReader

A helper struct to seek forwards and backwards in strings. Used by the tokenizer to read HTML from strings.

Tokenizer

A HTML tokenizer. See crate-level docs for basic usage.

Enums

Error

All parsing errors this tokenizer can emit.

Never

Definition of an empty enum.

State

The states you can set the tokenizer to.

Token

The token type used by default. You can define your own token type by implementing the crate::Emitter trait and using crate::Tokenizer::new_with_emitter.

Traits

Emitter

An emitter is an object providing methods to the tokenizer to produce tokens.

Readable

An object that can be converted into a crate::Reader.

Reader

An object that provides characters to the tokenizer.