Skip to main content

Crate tagsoup

Crate tagsoup 

Source
Expand description

TagSoup is a small, fast, fairly forgiving HTML-ish parser with zero required dependencies.

It is built for the boringly useful jobs:

  • Parse real-world markup without immediately fainting.
  • Walk the resulting tree.
  • Query it with a compact CSS-style selector API.
  • Pull out text, attributes, and spans.

It is not trying to impersonate a browser engine. It just wants to turn messy markup into something workable, quickly.

§Highlights

  • Optional serde support, enabled by default.
  • Preserves source spans for nodes and parse errors.
  • Handles raw-text elements like script and style sensibly.
  • Supports query_selector and query_selector_all.
  • Supports tree walking with a small visitor API.
  • Tries to recover from malformed markup instead of giving up immediately.

§Examples

// Parse an HTML tag soup.
let doc = tagsoup::Document::parse("<div><p id=here>Hello, world!</p></div>");

// Check for parsing errors.
assert!(doc.errors.is_empty());

// Query the document for an element using a CSS selector.
let element = doc.query_selector("#here").unwrap();
assert_eq!(element.text_content(), "Hello, world!");

§Querying The Tree

let doc = tagsoup::Document::parse(r#"
	<article id="main">
		<p class="lead">Hello</p>
		<p data-kind="feature card">world</p>
	</article>
"#);

assert_eq!(doc.query_selector("#main .lead").unwrap().text_content(), "Hello");
assert_eq!(doc.query_selector_all("[data-kind*=feature]").len(), 1);

§Notes

This is not a full WHATWG-compliant HTML parser. It is a pragmatic parser for documents that are mostly HTML, occasionally cursed, and still need to be dealt with.

Structs§

Attribute
Attribute of an element.
AttributeValue
Value of an attribute.
CommentNode
Comment node in the DOM tree.
DoctypeNode
Doctype node in the DOM tree.
Document
Document represents the entire parsed HTML document fragment.
Element
Element in the DOM tree.
Lexer
TagSoup lexer.
ParseError
Document parse error.
ProcessingInstructionNode
Processing instruction node in the DOM tree.
ResolvedSpan
Span of the information in the parsed source, with line and column information.
SourceSpan
Span of the information in the parsed source.
TextNode
Text node in the DOM tree.
Token
TagSoup Token.

Enums§

ElementKind
Kind of element.
Node
Node in the DOM tree.
ParseErrorKind
Document parse error kinds.
TokenKind
TagSoup Token kind.
VisitControl
Visitor control flow for tree traversal.

Functions§

normalize_whitespace
Collapses runs of ASCII whitespace into a single space.