Expand description
§scrape-core
High-performance HTML parsing library with CSS selector support.
This crate provides the core functionality for parsing HTML documents and querying them using CSS selectors. It is designed to be fast, memory-efficient, and spec-compliant.
§Quick Start
use scrape_core::{Html5everParser, Parser, Soup, SoupConfig};
// Parse HTML using Soup (high-level API)
let html = "<html><body><div class=\"product\">Hello</div></body></html>";
let soup = Soup::parse(html);
// Find elements using CSS selectors
if let Ok(Some(div)) = soup.find("div.product") {
assert_eq!(div.text(), "Hello");
}
// Or use the parser directly (low-level API)
let parser = Html5everParser;
let document = parser.parse(html).unwrap();
assert!(document.root().is_some());§Features
- Fast parsing: Built on
html5everfor spec-compliant HTML5 parsing - CSS selectors: Full CSS selector support via the
selectorscrate - Memory efficient: Arena-based allocation for DOM nodes
- SIMD acceleration: Optional SIMD support for faster byte scanning
§CSS Selector Support
The query engine supports most CSS3 selectors:
use scrape_core::Soup;
let html = r#"
<div class="container">
<ul id="list">
<li class="item active">One</li>
<li class="item">Two</li>
<li class="item">Three</li>
</ul>
</div>
"#;
let soup = Soup::parse(html);
// Type selector
let divs = soup.find_all("div").unwrap();
// Class selector
let items = soup.find_all(".item").unwrap();
// ID selector
let list = soup.find("#list").unwrap();
// Compound selector
let active = soup.find("li.item.active").unwrap();
// Descendant combinator
let nested = soup.find_all("div li").unwrap();
// Child combinator
let direct = soup.find_all("ul > li").unwrap();
// Attribute selectors
let with_id = soup.find_all("[id]").unwrap();Re-exports§
pub use query::CompiledSelector;pub use query::Filter;pub use query::OptimizationHint;pub use query::QueryError;pub use query::QueryResult;pub use query::SelectorExplanation;pub use query::Specificity;pub use query::TextNodesIter;pub use query::compile_selector;pub use query::explain;pub use query::explain_with_document;pub use serialize::HtmlSerializer;pub use serialize::collect_text;pub use serialize::serialize_inner_html;pub use serialize::serialize_node;pub use utils::escape_attr;pub use utils::escape_text;pub use utils::is_void_element;
Modules§
- query
- Query engine for finding elements in the DOM.
- serialize
- HTML serialization utilities.
- utils
- Shared utility functions for HTML processing.
Structs§
- Ancestors
Iter - Iterator over ancestors of a node (parent, grandparent, …).
- Building
- Document is being constructed.
- Children
Iter - Iterator over direct children of a node.
- Comment
Marker - Marker type for comment nodes.
- Descendants
Iter - Iterator over descendants in depth-first pre-order.
- Document
Impl - An HTML document containing a tree of nodes.
- Document
Index - Index for fast element lookup by ID and class.
- Element
Ancestors Iter - Iterator over element ancestors only.
- Element
Children Iter - Iterator over element children only.
- Element
Descendants Iter - Iterator over element descendants only.
- Element
Marker - Marker type for element nodes.
- Element
Next Siblings Iter - Iterator over next element siblings only.
- Element
Prev Siblings Iter - Iterator over previous element siblings only.
- Element
Siblings Iter - Iterator over all element siblings (excluding self).
- Html5ever
Parser - HTML5 spec-compliant parser using html5ever.
- Next
Siblings Iter - Iterator over siblings following a node.
- Node
- A node in the DOM tree.
- NodeId
- A node ID in the DOM tree.
- Parse
Config - Configuration for HTML parsing behavior.
- Parse
Result With Warnings - Result of parsing with warnings collected.
- Parse
Warning - A warning or error encountered during parsing.
- Prev
Siblings Iter - Iterator over siblings preceding a node.
- Queryable
- Document is built and ready for querying.
- Sealed
- Document is sealed and fully immutable.
- Siblings
Iter - Iterator over all siblings of a node (excluding the node itself).
- Soup
- A parsed HTML document.
- Soup
Config - Configuration options for HTML parsing.
- Source
Position - A position in source text (1-indexed line and column).
- Source
Span - A span in source text with start and end positions.
- Span
Context - Context around an error for display purposes.
- Tag
- A reference to an element in the document.
- Text
Marker - Marker type for text nodes.
Enums§
- Error
- Errors that can occur during HTML parsing and querying.
- Node
Kind - Types of nodes in the DOM tree.
- Parse
Error - Errors that can occur during HTML parsing.
- TagId
- Interned tag identifier for common HTML5 elements.
- Warning
Severity - Severity level for parse warnings.
Traits§
- Document
State - Marker trait for document states.
- Mutable
State - Trait for states that support modification.
- Node
Type - Trait implemented by all node type markers.
- Parser
- A sealed trait for HTML parsers.
- Queryable
State - Trait for states that support querying.
Type Aliases§
- Document
- Public alias for backward compatibility.
- Parse
Result - Result type for parser operations.
- Result
- Result type alias using
Error.