Crate tagsoup

Source

Expand description

TagSoup is a small, fast, fairly forgiving HTML-ish parser with zero required dependencies.

It is built for the boringly useful jobs:

Parse real-world markup without immediately fainting.
Walk the resulting tree.
Query it with a compact CSS-style selector API.
Pull out text, attributes, and spans.

It is not trying to impersonate a browser engine. It just wants to turn messy markup into something workable, quickly.

§Highlights

Optional serde support, enabled by default.
Preserves source spans for nodes and parse errors.
Handles raw-text elements like script and style sensibly.
Supports query_selector and query_selector_all.
Supports tree walking with a small visitor API.
Tries to recover from malformed markup instead of giving up immediately.

§Examples

// Parse an HTML tag soup.
let doc = tagsoup::Document::parse("<div><p id=here>Hello, world!</p></div>");

// Check for parsing errors.
assert!(doc.errors.is_empty());

// Query the document for an element using a CSS selector.
let element = doc.query_selector("#here").unwrap();
assert_eq!(element.text_content(), "Hello, world!");

§Querying The Tree

let doc = tagsoup::Document::parse(r#"
	<article id="main">
		<p class="lead">Hello</p>
		<p data-kind="feature card">world</p>
	</article>
"#);

assert_eq!(doc.query_selector("#main .lead").unwrap().text_content(), "Hello");
assert_eq!(doc.query_selector_all("[data-kind*=feature]").len(), 1);

§Notes

Whitespace is preserved by default.
Call Document::trimmed if you want leading and trailing ASCII whitespace removed from text nodes.
Element::text_content decodes HTML entities, except inside raw-text elements like script and style.
Invalid selectors currently panic in Document::query_selector and Document::query_selector_all.

This is not a full WHATWG-compliant HTML parser. It is a pragmatic parser for documents that are mostly HTML, occasionally cursed, and still need to be dealt with.

Structs§

Attribute: Attribute of an element.
AttributeValue: Value of an attribute.
CommentNode: Comment node in the DOM tree.
DoctypeNode: Doctype node in the DOM tree.
Document: Document represents the entire parsed HTML document fragment.
Element: Element in the DOM tree.
Lexer: TagSoup lexer.
ParseError: Document parse error.
ProcessingInstructionNode: Processing instruction node in the DOM tree.
ResolvedSpan: Span of the information in the parsed source, with line and column information.
SourceSpan: Span of the information in the parsed source.
TextNode: Text node in the DOM tree.
Token: TagSoup Token.

Enums§

ElementKind: Kind of element.
Node: Node in the DOM tree.
ParseErrorKind: Document parse error kinds.
TokenKind: TagSoup Token kind.
VisitControl: Visitor control flow for tree traversal.

Functions§

normalize_whitespace: Collapses runs of ASCII whitespace into a single space.

Crate tagsoup

Crate tagsoup Copy item path

§Highlights

§Examples

§Querying The Tree

§Notes

Structs§

Enums§

Functions§

Crate tagsoup