1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
/*!
TagSoup is a small, fast, fairly forgiving HTML-ish parser with zero required dependencies.
It is built for the boringly useful jobs:
- Parse real-world markup without immediately fainting.
- Walk the resulting tree.
- Query it with a compact CSS-style selector API.
- Pull out text, attributes, and spans.
It is not trying to impersonate a browser engine. It just wants to turn messy markup into something workable, quickly.
Loosely based on the [HTML Living Standard](https://html.spec.whatwg.org/multipage/syntax.html).
# Highlights
- Optional `serde` support, enabled by default.
- Preserves source spans for nodes and parse errors.
- Handles raw-text elements like `script` and `style` sensibly.
- Supports `query_selector` and `query_selector_all`.
- Supports tree walking with a small visitor API.
- Tries to recover from malformed markup instead of giving up immediately.
# Example
```
// Parse an HTML tag soup.
let doc = tagsoup::Document::parse("<div><p id=here>Hello, world!</p></div>");
// Check for parsing errors.
assert!(doc.errors.is_empty());
// Query the document for an element using a CSS selector.
let element = doc.query_selector("#here").unwrap();
assert_eq!(element.text_content(), "Hello, world!");
```
# Visiting The Tree
```
let html = r#"
<ul>
<li><a href="/one">One</a></li>
<li><a href="/two">Two</a></li>
</ul>
"#;
let doc = tagsoup::Document::parse(html);
let mut hrefs = Vec::new();
doc.visit(&mut |element| {
if element.tag.eq_ignore_ascii_case("a") {
if let Some(href) = element.get_attribute_value("href") {
hrefs.push(href);
}
}
tagsoup::VisitControl::Descend
});
assert_eq!(hrefs, vec!["/one", "/two"]);
```
# Notes
- Whitespace is preserved by default.
- Call [`Document::trimmed`] if you want leading and trailing ASCII whitespace removed from text nodes.
- [`Element::text_content`] decodes HTML entities, except inside raw-text elements like `script` and `style`.
- Invalid selectors currently panic in [`Document::query_selector`] and [`Document::query_selector_all`].
This is not a full WHATWG-compliant HTML parser. It is a pragmatic parser for documents that are mostly HTML, occasionally cursed, and still need to be dealt with.
*/
use HashMap;
use Cow;
use ;
pub use *;
pub use *;
pub use *;
pub use *;
pub use *;
pub use *;
pub use *;
pub use *;
pub use *;