Expand description
TagSoup is a small, fast, fairly forgiving HTML-ish parser with zero required dependencies.
It is built for the boringly useful jobs:
- Parse real-world markup without immediately fainting.
- Walk the resulting tree.
- Query it with a compact CSS-style selector API.
- Pull out text, attributes, and spans.
It is not trying to impersonate a browser engine. It just wants to turn messy markup into something workable, quickly.
§Highlights
- Optional
serdesupport, enabled by default. - Preserves source spans for nodes and parse errors.
- Handles raw-text elements like
scriptandstylesensibly. - Supports
query_selectorandquery_selector_all. - Supports tree walking with a small visitor API.
- Tries to recover from malformed markup instead of giving up immediately.
§Examples
// Parse an HTML tag soup.
let doc = tagsoup::Document::parse("<div><p id=here>Hello, world!</p></div>");
// Check for parsing errors.
assert!(doc.errors.is_empty());
// Query the document for an element using a CSS selector.
let element = doc.query_selector("#here").unwrap();
assert_eq!(element.text_content(), "Hello, world!");§Querying The Tree
let doc = tagsoup::Document::parse(r#"
<article id="main">
<p class="lead">Hello</p>
<p data-kind="feature card">world</p>
</article>
"#);
assert_eq!(doc.query_selector("#main .lead").unwrap().text_content(), "Hello");
assert_eq!(doc.query_selector_all("[data-kind*=feature]").len(), 1);§Notes
- Whitespace is preserved by default.
- Call
Document::trimmedif you want leading and trailing ASCII whitespace removed from text nodes. Element::text_contentdecodes HTML entities, except inside raw-text elements likescriptandstyle.- Invalid selectors currently panic in
Document::query_selectorandDocument::query_selector_all.
This is not a full WHATWG-compliant HTML parser. It is a pragmatic parser for documents that are mostly HTML, occasionally cursed, and still need to be dealt with.
Structs§
- Attribute
- Attribute of an element.
- Attribute
Value - Value of an attribute.
- Comment
Node - Comment node in the DOM tree.
- Doctype
Node - Doctype node in the DOM tree.
- Document
- Document represents the entire parsed HTML document fragment.
- Element
- Element in the DOM tree.
- Lexer
- TagSoup lexer.
- Parse
Error - Document parse error.
- Processing
Instruction Node - Processing instruction node in the DOM tree.
- Resolved
Span - Span of the information in the parsed source, with line and column information.
- Source
Span - Span of the information in the parsed source.
- Text
Node - Text node in the DOM tree.
- Token
- TagSoup Token.
Enums§
- Element
Kind - Kind of element.
- Node
- Node in the DOM tree.
- Parse
Error Kind - Document parse error kinds.
- Token
Kind - TagSoup Token kind.
- Visit
Control - Visitor control flow for tree traversal.
Functions§
- normalize_
whitespace - Collapses runs of ASCII whitespace into a single space.