Crate html5tokenizer[−][src]
Expand description
html5tokenizer
html5tokenizer
is a WHATWG-compliant HTML tokenizer (forked from
html5gum with added code span support).
use std::fmt::Write;
use html5tokenizer::{Tokenizer, Token};
let html = "<title >hello world</title>";
let mut new_html = String::new();
for token in Tokenizer::new(html).infallible() {
match token {
Token::StartTag(tag) => {
write!(new_html, "<{}>", tag.name).unwrap();
}
Token::String(hello_world) => {
write!(new_html, "{}", hello_world).unwrap();
}
Token::EndTag(tag) => {
write!(new_html, "</{}>", tag.name).unwrap();
}
_ => panic!("unexpected input"),
}
}
assert_eq!(new_html, "<title>hello world</title>");
What a tokenizer does and what it does not do
html5tokenizer
fully implements 13.2.5 of the WHATWG HTML
spec, i.e. is able to tokenize HTML documents and passes html5lib’s tokenizer
test suite. Since it is just a tokenizer, this means:
html5tokenizer
does not implement charset detection. This implementation requires all input to be Rust strings and therefore valid UTF-8.html5tokenizer
does not correct mis-nested tags.html5tokenizer
does not recognize implicitly self-closing elements like<img>
, as a tokenizer it will simply emit a start token. It does however emit a self-closing tag for<img .. />
.html5tokenizer
does not generally qualify as a browser-grade HTML parser as per the WHATWG spec. This can change in the future.
With those caveats in mind, html5tokenizer
can pretty much ~parse~ tokenize
anything that browsers can.
The Emitter
trait
A distinguishing feature of html5tokenizer
is that you can bring your own token
datastructure and hook into token creation by implementing the Emitter
trait.
This allows you to:
-
Rewrite all per-HTML-tag allocations to use a custom allocator or datastructure.
-
Efficiently filter out uninteresting categories data without ever allocating for it. For example if any plaintext between tokens is not of interest to you, you can implement the respective trait methods as noop and therefore avoid any overhead creating plaintext tokens.
License
Licensed under the MIT license, see ./LICENSE
.
Modules
Source code spans.
Structs
A HTML attribute value (plus spans).
A BufReadReader
can be used to construct a tokenizer from any type that implements
BufRead
.
The default implementation of crate::Emitter
, used to produce (“emit”) tokens.
A doctype. Some examples:
A HTML end/close tag, such as </p>
or </a>
.
A kind of tokenizer that directly yields tokens when used as an iterator, so Token
instead of
Result<Token, _>
.
A HTML end/close tag, such as <p>
or <a>
.
A helper struct to seek forwards and backwards in strings. Used by the tokenizer to read HTML from strings.
A HTML tokenizer. See crate-level docs for basic usage.
Enums
All parsing errors this tokenizer can emit.
Definition of an empty enum.
The states you can set the tokenizer to.
The token type used by default. You can define your own token type by implementing the
crate::Emitter
trait and using crate::Tokenizer::new_with_emitter
.
Traits
An emitter is an object providing methods to the tokenizer to produce tokens.
An object that can be converted into a crate::Reader
.
An object that provides characters to the tokenizer.