Crate html5tokenizer

Source
Expand description

§html5tokenizer

docs.rs crates.io

Spec-compliant HTML parsing requires both tokenization and tree-construction. While this crate implements a spec-compliant HTML tokenizer it does not implement any tree-construction. Instead it just provides a NaiveParser that may be used as follows:

use std::fmt::Write;
use html5tokenizer::{NaiveParser, Token};

let html = "<title   >hello world</title>";
let mut new_html = String::new();

for token in NaiveParser::new(html).flatten() {
    match token {
        Token::StartTag(tag) => {
            write!(new_html, "<{}>", tag.name).unwrap();
        }
        Token::Char(c) => {
            write!(new_html, "{c}").unwrap();
        }
        Token::EndTag(tag) => {
            write!(new_html, "</{}>", tag.name).unwrap();
        }
        Token::EndOfFile => {},
        _ => panic!("unexpected input"),
    }
}

assert_eq!(new_html, "<title>hello world</title>");

This library can provide source spans. For an example, see examples/spans.rs, which produces the following output:

note:
  ┌─ file.html:1:2
  │
1 │ <img src=example.jpg alt="some description">
  │  ^^^ ^^^ ^^^^^^^^^^^ ^^^  ^^^^^^^^^^^^^^^^ attr value
  │  │   │   │           │
  │  │   │   │           attr name
  │  │   │   attr value
  │  │   attr name
  │  tag name

§Limitations

  • This crate does not yet implement tree construction
    (which is necessary for spec-compliant HTML parsing).

  • This crate does not yet implement character encoding detection.

§Compliance & testing

The tokenizer passes the html5lib tokenizer test suite. The library is not yet fuzz tested.

§Credits

html5tokenizer was forked from html5gum 0.2.1, which was created by Markus Unterwaditzer who deserves major props for implementing all 80 (!) tokenizer states.

  • Code span support has been added.
  • The API has been revised.

For details please refer to the changelog.

§License

Licensed under the MIT license, see the LICENSE file.

Re-exports§

pub use token::Doctype;
pub use token::EndTag;
pub use token::StartTag;
pub use token::Token;

Modules§

attr
Types for HTML attributes.
offset
Source code offsets.
reader
Provides the Reader trait (and implementations) used by the tokenizer.
token
Provides the Token type.
trace
Provides the Trace type (byte offsets and syntax information about tokens).

Structs§

BasicEmitter
An Emitter implementation that yields Token.
NaiveParser
A naive HTML parser (not spec-compliant since it doesn’t do tree construction).
Tokenizer
An HTML tokenizer.
TracingEmitter
The default implementation of Emitter, used to produce tokens.

Enums§

Error
All parse errors this tokenizer can emit.
Event
An event yielded by the Iterator implementation for the Tokenizer.
State
The states you can set the tokenizer to.

Traits§

Emitter
An emitter is an object providing methods to the tokenizer to produce (“emit”) tokens.