Skip to main content

Crate scrape_core

Crate scrape_core 

Source
Expand description

§scrape-core

High-performance HTML parsing library with CSS selector support.

This crate provides the core functionality for parsing HTML documents and querying them using CSS selectors. It is designed to be fast, memory-efficient, and spec-compliant.

§Quick Start

use scrape_core::{Html5everParser, Parser, Soup, SoupConfig};

// Parse HTML using Soup (high-level API)
let html = "<html><body><div class=\"product\">Hello</div></body></html>";
let soup = Soup::parse(html);

// Find elements using CSS selectors
if let Ok(Some(div)) = soup.find("div.product") {
    assert_eq!(div.text(), "Hello");
}

// Or use the parser directly (low-level API)
let parser = Html5everParser;
let document = parser.parse(html).unwrap();
assert!(document.root().is_some());

§Features

  • Fast parsing: Built on html5ever for spec-compliant HTML5 parsing
  • CSS selectors: Full CSS selector support via the selectors crate
  • Memory efficient: Arena-based allocation for DOM nodes
  • SIMD acceleration: Optional SIMD support for faster byte scanning

§CSS Selector Support

The query engine supports most CSS3 selectors:

use scrape_core::Soup;

let html = r#"
    <div class="container">
        <ul id="list">
            <li class="item active">One</li>
            <li class="item">Two</li>
            <li class="item">Three</li>
        </ul>
    </div>
"#;
let soup = Soup::parse(html);

// Type selector
let divs = soup.find_all("div").unwrap();

// Class selector
let items = soup.find_all(".item").unwrap();

// ID selector
let list = soup.find("#list").unwrap();

// Compound selector
let active = soup.find("li.item.active").unwrap();

// Descendant combinator
let nested = soup.find_all("div li").unwrap();

// Child combinator
let direct = soup.find_all("ul > li").unwrap();

// Attribute selectors
let with_id = soup.find_all("[id]").unwrap();

Re-exports§

pub use query::CompiledSelector;
pub use query::Filter;
pub use query::OptimizationHint;
pub use query::QueryError;
pub use query::QueryResult;
pub use query::SelectorExplanation;
pub use query::Specificity;
pub use query::TextNodesIter;
pub use query::compile_selector;
pub use query::explain;
pub use query::explain_with_document;
pub use serialize::HtmlSerializer;
pub use serialize::collect_text;
pub use serialize::serialize_inner_html;
pub use serialize::serialize_node;
pub use utils::escape_attr;
pub use utils::escape_text;
pub use utils::is_void_element;

Modules§

query
Query engine for finding elements in the DOM.
serialize
HTML serialization utilities.
utils
Shared utility functions for HTML processing.

Structs§

AncestorsIter
Iterator over ancestors of a node (parent, grandparent, …).
Building
Document is being constructed.
ChildrenIter
Iterator over direct children of a node.
CommentMarker
Marker type for comment nodes.
DescendantsIter
Iterator over descendants in depth-first pre-order.
DocumentImpl
An HTML document containing a tree of nodes.
DocumentIndex
Index for fast element lookup by ID and class.
ElementAncestorsIter
Iterator over element ancestors only.
ElementChildrenIter
Iterator over element children only.
ElementDescendantsIter
Iterator over element descendants only.
ElementMarker
Marker type for element nodes.
ElementNextSiblingsIter
Iterator over next element siblings only.
ElementPrevSiblingsIter
Iterator over previous element siblings only.
ElementSiblingsIter
Iterator over all element siblings (excluding self).
Html5everParser
HTML5 spec-compliant parser using html5ever.
NextSiblingsIter
Iterator over siblings following a node.
Node
A node in the DOM tree.
NodeId
A node ID in the DOM tree.
ParseConfig
Configuration for HTML parsing behavior.
ParseResultWithWarnings
Result of parsing with warnings collected.
ParseWarning
A warning or error encountered during parsing.
PrevSiblingsIter
Iterator over siblings preceding a node.
Queryable
Document is built and ready for querying.
Sealed
Document is sealed and fully immutable.
SiblingsIter
Iterator over all siblings of a node (excluding the node itself).
Soup
A parsed HTML document.
SoupConfig
Configuration options for HTML parsing.
SourcePosition
A position in source text (1-indexed line and column).
SourceSpan
A span in source text with start and end positions.
SpanContext
Context around an error for display purposes.
Tag
A reference to an element in the document.
TextMarker
Marker type for text nodes.

Enums§

Error
Errors that can occur during HTML parsing and querying.
NodeKind
Types of nodes in the DOM tree.
ParseError
Errors that can occur during HTML parsing.
TagId
Interned tag identifier for common HTML5 elements.
WarningSeverity
Severity level for parse warnings.

Traits§

DocumentState
Marker trait for document states.
MutableState
Trait for states that support modification.
NodeType
Trait implemented by all node type markers.
Parser
A sealed trait for HTML parsers.
QueryableState
Trait for states that support querying.

Type Aliases§

Document
Public alias for backward compatibility.
ParseResult
Result type for parser operations.
Result
Result type alias using Error.