TagSoup
TagSoup is a small, fast, fairly forgiving HTML-ish parser written in Rust.
It is built for the boringly useful jobs:
- Parse real-world markup without immediately fainting.
- Walk the resulting tree.
- Query it with a compact CSS-style selector API.
- Pull out text, attributes, and spans.
It is not trying to impersonate a browser engine. It just wants to turn messy markup into something workable, quickly.
Loosely based on the HTML Living Standard.
Features
- Zero required dependencies.
- Optional
serdesupport, enabled by default. - Preserves source spans for nodes and parse errors.
- Handles raw-text elements like
scriptandstylesensibly. - Supports
query_selectorandquery_selector_all. - Supports tree walking with a small visitor API.
- Tries to recover from malformed markup instead of giving up immediately.
Installation
[]
= "0.1.1"
If you want to keep it dependency-free all the way down:
[]
= { = "0.1.1", = false }
Usage
// Parse an HTML tag soup.
let doc = parse;
// Check for parsing errors.
assert!;
// Query the document for an element using a CSS selector.
let element = doc.query_selector.unwrap;
assert_eq!;
If you want to collect data from the tree directly, use the visitor API:
let html = r#"
<ul>
<li><a href="/one">One</a></li>
<li><a href="/two">Two</a></li>
</ul>
"#;
let doc = parse;
let mut hrefs = Vecnew;
doc.visit;
assert_eq!;
If you prefer selectors over walking the tree yourself:
let doc = parse;
assert_eq!;
assert_eq!;
Selector Support
The selector engine is intentionally compact, but it covers the selectors you usually want for scraping and document inspection:
- Tag selectors:
div - ID selectors:
#main - Class selectors:
.hero - Attribute presence:
[href] - Attribute equality:
[lang=en] - Attribute contains:
[data-kind*=feature] - Attribute prefix and suffix:
[src^=http],[src$=.png] - Whitespace-separated attribute matching:
[rel~=nofollow] - Descendant, child, and sibling combinators:
article .lead,ul > li,h2 + p,h2 ~ p
Invalid selectors currently panic in query_selector and query_selector_all, so if the selector is user input, validate or sanitize it first.
Parsing Notes
- Whitespace is preserved by default.
- Call
trimmed()if you want leading and trailing ASCII whitespace removed from text nodes. text_content()decodes HTML entities, except inside raw-text elements likescriptandstyle.- Parse errors are collected in
document.errorsinstead of stopping the parse. - Errors include source spans, so reporting decent diagnostics is straightforward.
There is also a tiny example CLI that reads HTML from stdin and dumps JSON:
What This Is Not
- Not a browser DOM implementation.
- Not a full WHATWG-compliant HTML parser.
- Not trying to perfectly reproduce browser error recovery in every bizarre corner case.
It is a pragmatic parser for documents that are mostly HTML, occasionally cursed, and still need to be dealt with.
License
Licensed under MIT License, see license.txt.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, shall be licensed as above, without any additional terms or conditions.