Expand description
crawlberg – A Rust crawling engine for turning websites into structured data.
Re-exports§
pub use budget::BudgetError;pub use budget::DefaultPageBudget;pub use budget::PageBudget;pub use interact::MAX_ACTIONS;pub use interact::MAX_SCRIPT_LEN;pub use interact::MAX_SCROLL_AMOUNT;pub use interact::MAX_SELECTOR_LEN;pub use interact::MAX_SINGLE_WAIT_MS;pub use interact::MAX_TEXT_LEN;pub use interact::MAX_TOTAL_WAIT_SECS;pub use interact::PageAction;pub use interact::ScrollDirection;pub use interact::validate_actions;pub use net::ssrf::HostMatcher;pub use net::ssrf::SsrfError;pub use net::ssrf::SsrfPolicy;pub use net::ssrf::validate_url;pub use proxy::ProxyProvider;pub use proxy::StaticProxyProvider;pub use sink::EventSink;pub use sink::MultiEventSink;pub use sink::TracingEventSink;pub use telemetry::current_traceparent;pub use telemetry::with_traceparent;
Modules§
- budget
- Pluggable page budget hook for controlling crawl extent.
- http
- HTTP fetching with redirect handling, retry logic, and cookie extraction.
- interact
- Page interaction module for action-based browser automation.
- net
- Network utilities: SSRF policy, validation, and security.
- proxy
- Proxy provider trait + baseline impl.
- robots
- Robots.txt parsing and path-matching logic.
- sink
- Pluggable event sink for streaming crawl events.
- sitemap
- Sitemap XML parsing and recursive fetching.
- telemetry
- OpenTelemetry foundations for crawlberg.
- traits
- Trait-based extension points for the crawl engine.
Structs§
- Action
Result - Result from a single page action execution.
- Adaptive
Strategy - Adaptive crawling strategy that stops when content saturation is detected.
- Article
Metadata - Article metadata extracted from
article:*Open Graph tags. - Attempt
Outcome - Rich context passed to
RetryPolicy::decideon each attempt. - Batch
Crawl Result - Result from a single URL in a batch crawl operation.
- Batch
Crawl Results - Aggregate result of a batch crawl, exposing per-URL results plus precomputed counts.
- Batch
Crawl Stream Request - Request to begin a multi-URL streaming crawl.
- Batch
Scrape Result - Result from a single URL in a batch scrape operation.
- Batch
Scrape Results - Aggregate result of a batch scrape, exposing per-URL results plus precomputed counts.
- Best
First Strategy - A best-first crawl strategy.
- BfsStrategy
- A breadth-first crawl strategy.
- Browser
Config - Browser fallback configuration.
- Browser
Extras - Browser-specific extras populated when the native browser backend was used.
- Budget
Exhausted - Returned by
EscalationBudget::try_consumewhen no budget remains. - Bypass
Response - Response returned by a
BypassProvider::fetchcall. - Cached
Page - Cached page data for HTTP response caching.
- Citation
Reference - A single numbered reference in a citation list — produced by the citation
extractor when content uses inline
[N]-style markers. - Citation
Result - Result of citation conversion.
- Content
Config - Content extraction and conversion configuration.
- Cookie
Info - Information about an HTTP cookie received from a response.
- Crawl
Config - Configuration for crawl, scrape, and map operations.
- Crawl
Config Builder - Fluent builder for
CrawlConfig. - Crawl
Engine - The main crawl engine, composed of pluggable trait implementations.
- Crawl
Engine Builder - Builder for
CrawlEngine. - Crawl
Engine Handle - Opaque handle to a configured crawl engine.
- Crawl
Page Result - The result of crawling a single page during a crawl operation.
- Crawl
Result - The result of a multi-page crawl operation.
- Crawl
Stream Request - Request to begin a single-URL streaming crawl.
- Default
Antibot Strategy - Default
AntibotStrategythat mirrors the pre-Cluster-5 engine behaviour. - DfsStrategy
- A depth-first crawl strategy.
- Dispatch
Profile - Bundle of pluggable dispatch components attached to
crate::types::CrawlConfig. - Dispatch
Profile Builder - Fluent builder for
DispatchProfile. - Domain
Observation - Single fetch outcome reported to
DomainStatePort::observe. The backend turns these into its own state model (EWMA, rule-based, histogram, etc). - Domain
Recommendation - Recommendation returned by
DomainStatePort::recommendfor the next fetch attempt against a domain. Generic over the backend’s internal model — the only data the engine needs to act on is which tier to start at and how confident the backend is in that choice. - Downloaded
Asset - A downloaded asset from a page.
- Downloaded
Document - A downloaded non-HTML document (PDF, DOCX, image, code file, etc.).
- Ewma
Domain State - Process-local domain state backed by an EWMA block-rate model.
DashMap-backed, ephemeral — no persistence across restarts. For multi-process / multi-tenant learning, use xberg-enterprise’s PostgresDomainState. - Ewma
Tracker - Pure-math EWMA with promote/demote thresholds. Stateless — caller supplies the prior and the observation.
- Extraction
Meta - Metadata about an LLM extraction pass.
- Favicon
Info - Information about a favicon or icon link.
- Feed
Info - Information about a feed link found on a page.
- Fixed
Budget EscalationBudgetbacked by an atomic counter. Decrements on eachtry_consume; returnsErr(BudgetExhausted)once the remaining budget can’t cover the request. Useful for self-hosters that want per-process spend caps without a database.- Heading
Info - A heading element extracted from the page.
- Hreflang
Entry - An hreflang alternate link entry.
- Image
Info - Information about an image found on a page.
- InMemory
Frontier - A simple in-memory URL frontier with deduplication.
- Interaction
Result - Result of executing a sequence of page interaction actions.
- Json
LdEntry - A JSON-LD structured data entry found on a page.
- Learning
Retry Policy - Retry policy that consults a
DomainStatePortfor the per-domain prior on each decision. Falls back toSimpleRetryPolicysemantics when no state is available for the domain. - Link
Info - Information about a link found on a page.
- MapResult
- The result of a map operation, containing discovered URLs.
- Markdown
Result - Rich markdown conversion result from HTML processing.
- Noop
Cache - No-op cache that never stores or returns anything.
- Noop
Emitter - An event emitter that does nothing – all events are silently discarded.
- Noop
Filter - A content filter that passes everything through without modification.
- Noop
Store - A store that does nothing – crawl results are discarded.
- Page
Metadata - Metadata extracted from an HTML page’s
<meta>tags and<title>element. - PerDomain
Throttle - A per-domain token bucket rate limiter.
- Proxy
Config - Proxy configuration for HTTP requests.
- Response
Meta - Response metadata extracted from HTTP headers.
- Scrape
Result - The result of a single-page scrape operation.
- Simple
Retry Policy - Per-error mapping with no learning. The simplest possible
RetryPolicy— useful as a baseline and as a fallback when no state backend is configured. - Sitemap
Url - A URL entry from a sitemap.
- Toml
Classifier - Default
WafClassifierbacked by a TOML fingerprint corpus. - Unlimited
Budget EscalationBudgetthat always permits escalation. Used by default when no budget is configured onCrawlConfig.- WafRules
- Compiled WAF rules: fingerprint list + single Aho-Corasick automaton.
- WafSignal
- Output of a WAF classifier — a single fingerprint match.
- WafWatch
Handle - Drop-on-shutdown handle returned by
TomlClassifier::watch.
Enums§
- Antibot
Error - Errors from
AntibotStrategyhook invocations. - Asset
Category - The category of a downloaded asset.
- Auth
Config - Authentication configuration.
- Browser
Backend - Browser backend used for JavaScript rendering.
- Browser
Mode - When to use the headless browser fallback.
- Browser
Wait - Wait strategy for browser page rendering.
- Crawl
Error - Errors that can occur during crawling, scraping, or mapping operations.
- Crawl
Event - An event emitted during a streaming crawl operation.
- Decision
- What the dispatch loop does after the antibot
post_responsehook runs. - Escalation
Reason - Why the dispatcher should escalate to the next tier.
- Escalation
Strategy - Defines the escalation chain when a tier produces a block signal.
- Feed
Type - The type of a feed (RSS, Atom, or JSON Feed).
- Image
Source - The source of an image reference.
- Link
Type - The classification of a link.
- Observed
Outcome - Classification of a single fetch outcome.
- Retry
Directive - What the dispatcher does next, returned by
RetryPolicy::decide. - Tier
- Which tier produced the current attempt’s outcome.
- WafClassify
Error - Errors returned by
WafClassifier::classify. - WafRules
Error - Error returned when loading or validating a rules file.
- WafWatch
Error - Error returned when setting up a
WatchHandle.
Traits§
- Antibot
Strategy - Pluggable antibot hook pair.
- Bypass
Provider - Caller-supplied bypass backend. Implementations are responsible for
vendor authentication, request shaping, response decoding, and mapping
vendor errors into
CrawlError. - Domain
State Port - Persistent per-domain dispatch state.
- Escalation
Budget - Pluggable per-job escalation budget.
- Retry
Policy - Pluggable per-attempt decision policy.
- WafClassifier
- Pluggable WAF detection.
Functions§
- batch_
crawl - Crawl multiple seed URLs concurrently, each following links to configured depth.
- batch_
crawl_ stream - Stream a multi-URL crawl, yielding
CrawlEvents across all seeds. - batch_
scrape - Scrape multiple URLs concurrently.
- crawl
- Crawl a website starting from
url, following links up to the configured depth. - crawl_
stream - Stream a single-URL crawl, yielding
CrawlEvents as pages are processed. - create_
engine - Create a new crawl engine with the given configuration.
- default_
retry_ policy - Convenience constructor:
Arc<dyn RetryPolicy>for the default policy. - generate_
citations - Convert markdown links to numbered citations.
- in_
memory_ domain_ state - Convenience constructor:
Arc<dyn DomainStatePort>backed by an in-memory EWMA map. - interact
- Execute browser actions on a single page.
- map_
urls - Discover all pages on a website by following links and sitemaps.
- scrape
- Scrape a single URL, returning extracted page data.
- unlimited_
budget - Convenience constructor:
Arc<dyn EscalationBudget>that never blocks. - waf_
rules_ from_ path - Load and compile rules from a TOML file on disk.
- waf_
rules_ from_ str - Load and compile rules from a TOML string.
Type Aliases§
- DynAntibot
Strategy - Convenience alias.
- DynBypass
Provider - Convenience type alias used on
CrawlConfig.bypass. - DynDomain
State Port - Convenience alias for an owned, type-erased domain-state backend.
- DynEscalation
Budget - Convenience alias for an owned, type-erased budget.
- DynRetry
Policy - Convenience alias for an owned, type-erased retry policy on
crate::types::CrawlConfig. - DynWaf
Classifier - Convenience alias for an owned, type-erased WAF classifier.