Crate scrape_core

Crate scrape_core 

Source
Expand description

§scrape-core

High-performance HTML parsing library with CSS selector support.

This crate provides the core functionality for parsing HTML documents and querying them using CSS selectors. It is designed to be fast, memory-efficient, and spec-compliant.

§Quick Start

use scrape_core::{Html5everParser, Parser, Soup, SoupConfig};

// Parse HTML using Soup (high-level API)
let html = "<html><body><div class=\"product\">Hello</div></body></html>";
let soup = Soup::parse(html);

// Find elements using CSS selectors
if let Ok(Some(div)) = soup.find("div.product") {
    assert_eq!(div.text(), "Hello");
}

// Or use the parser directly (low-level API)
let parser = Html5everParser;
let document = parser.parse(html).unwrap();
assert!(document.root().is_some());

§Features

  • Fast parsing: Built on html5ever for spec-compliant HTML5 parsing
  • CSS selectors: Full CSS selector support via the selectors crate
  • Memory efficient: Arena-based allocation for DOM nodes
  • SIMD acceleration: Optional SIMD support for faster byte scanning

§CSS Selector Support

The query engine supports most CSS3 selectors:

use scrape_core::Soup;

let html = r#"
    <div class="container">
        <ul id="list">
            <li class="item active">One</li>
            <li class="item">Two</li>
            <li class="item">Three</li>
        </ul>
    </div>
"#;
let soup = Soup::parse(html);

// Type selector
let divs = soup.find_all("div").unwrap();

// Class selector
let items = soup.find_all(".item").unwrap();

// ID selector
let list = soup.find("#list").unwrap();

// Compound selector
let active = soup.find("li.item.active").unwrap();

// Descendant combinator
let nested = soup.find_all("div li").unwrap();

// Child combinator
let direct = soup.find_all("ul > li").unwrap();

// Attribute selectors
let with_id = soup.find_all("[id]").unwrap();

Re-exports§

pub use query::Filter;
pub use query::QueryError;
pub use query::QueryResult;

Modules§

query
Query engine for finding elements in the DOM.

Structs§

AncestorsIter
Iterator over ancestors of a node (parent, grandparent, …).
ChildrenIter
Iterator over direct children of a node.
DescendantsIter
Iterator over descendants in depth-first pre-order.
Document
An HTML document containing a tree of nodes.
Html5everParser
HTML5 spec-compliant parser using html5ever.
Node
A node in the DOM tree.
NodeId
A node ID in the DOM tree.
ParseConfig
Configuration for HTML parsing behavior.
Soup
A parsed HTML document.
SoupConfig
Configuration options for HTML parsing.
Tag
A reference to an element in the document.

Enums§

Error
Errors that can occur during HTML parsing and querying.
NodeKind
Types of nodes in the DOM tree.
ParseError
Errors that can occur during HTML parsing.

Traits§

Parser
A sealed trait for HTML parsers.

Type Aliases§

ParseResult
Result type for parser operations.
Result
Result type alias using Error.