Skip to main content

Crate crawlberg

Crate crawlberg 

Source
Expand description

crawlberg – A Rust crawling engine for turning websites into structured data.

Re-exports§

pub use budget::BudgetError;
pub use budget::DefaultPageBudget;
pub use budget::PageBudget;
pub use interact::MAX_ACTIONS;
pub use interact::MAX_SCRIPT_LEN;
pub use interact::MAX_SCROLL_AMOUNT;
pub use interact::MAX_SELECTOR_LEN;
pub use interact::MAX_SINGLE_WAIT_MS;
pub use interact::MAX_TEXT_LEN;
pub use interact::MAX_TOTAL_WAIT_SECS;
pub use interact::PageAction;
pub use interact::ScrollDirection;
pub use interact::validate_actions;
pub use net::ssrf::HostMatcher;
pub use net::ssrf::SsrfError;
pub use net::ssrf::SsrfPolicy;
pub use net::ssrf::validate_url;
pub use proxy::ProxyProvider;
pub use proxy::StaticProxyProvider;
pub use sink::EventSink;
pub use sink::MultiEventSink;
pub use sink::TracingEventSink;
pub use telemetry::current_traceparent;
pub use telemetry::with_traceparent;

Modules§

budget
Pluggable page budget hook for controlling crawl extent.
http
HTTP fetching with redirect handling, retry logic, and cookie extraction.
interact
Page interaction module for action-based browser automation.
net
Network utilities: SSRF policy, validation, and security.
proxy
Proxy provider trait + baseline impl.
robots
Robots.txt parsing and path-matching logic.
sink
Pluggable event sink for streaming crawl events.
sitemap
Sitemap XML parsing and recursive fetching.
telemetry
OpenTelemetry foundations for crawlberg.
traits
Trait-based extension points for the crawl engine.

Structs§

ActionResult
Result from a single page action execution.
AdaptiveStrategy
Adaptive crawling strategy that stops when content saturation is detected.
ArticleMetadata
Article metadata extracted from article:* Open Graph tags.
AttemptOutcome
Rich context passed to RetryPolicy::decide on each attempt.
BatchCrawlResult
Result from a single URL in a batch crawl operation.
BatchCrawlResults
Aggregate result of a batch crawl, exposing per-URL results plus precomputed counts.
BatchCrawlStreamRequest
Request to begin a multi-URL streaming crawl.
BatchScrapeResult
Result from a single URL in a batch scrape operation.
BatchScrapeResults
Aggregate result of a batch scrape, exposing per-URL results plus precomputed counts.
BestFirstStrategy
A best-first crawl strategy.
BfsStrategy
A breadth-first crawl strategy.
BrowserConfig
Browser fallback configuration.
BrowserExtras
Browser-specific extras populated when the native browser backend was used.
BudgetExhausted
Returned by EscalationBudget::try_consume when no budget remains.
BypassResponse
Response returned by a BypassProvider::fetch call.
CachedPage
Cached page data for HTTP response caching.
CitationReference
A single numbered reference in a citation list — produced by the citation extractor when content uses inline [N]-style markers.
CitationResult
Result of citation conversion.
ContentConfig
Content extraction and conversion configuration.
CookieInfo
Information about an HTTP cookie received from a response.
CrawlConfig
Configuration for crawl, scrape, and map operations.
CrawlConfigBuilder
Fluent builder for CrawlConfig.
CrawlEngine
The main crawl engine, composed of pluggable trait implementations.
CrawlEngineBuilder
Builder for CrawlEngine.
CrawlEngineHandle
Opaque handle to a configured crawl engine.
CrawlPageResult
The result of crawling a single page during a crawl operation.
CrawlResult
The result of a multi-page crawl operation.
CrawlStreamRequest
Request to begin a single-URL streaming crawl.
DefaultAntibotStrategy
Default AntibotStrategy that mirrors the pre-Cluster-5 engine behaviour.
DfsStrategy
A depth-first crawl strategy.
DispatchProfile
Bundle of pluggable dispatch components attached to crate::types::CrawlConfig.
DispatchProfileBuilder
Fluent builder for DispatchProfile.
DomainObservation
Single fetch outcome reported to DomainStatePort::observe. The backend turns these into its own state model (EWMA, rule-based, histogram, etc).
DomainRecommendation
Recommendation returned by DomainStatePort::recommend for the next fetch attempt against a domain. Generic over the backend’s internal model — the only data the engine needs to act on is which tier to start at and how confident the backend is in that choice.
DownloadedAsset
A downloaded asset from a page.
DownloadedDocument
A downloaded non-HTML document (PDF, DOCX, image, code file, etc.).
EwmaDomainState
Process-local domain state backed by an EWMA block-rate model. DashMap-backed, ephemeral — no persistence across restarts. For multi-process / multi-tenant learning, use xberg-enterprise’s PostgresDomainState.
EwmaTracker
Pure-math EWMA with promote/demote thresholds. Stateless — caller supplies the prior and the observation.
ExtractionMeta
Metadata about an LLM extraction pass.
FaviconInfo
Information about a favicon or icon link.
FeedInfo
Information about a feed link found on a page.
FixedBudget
EscalationBudget backed by an atomic counter. Decrements on each try_consume; returns Err(BudgetExhausted) once the remaining budget can’t cover the request. Useful for self-hosters that want per-process spend caps without a database.
HeadingInfo
A heading element extracted from the page.
HreflangEntry
An hreflang alternate link entry.
ImageInfo
Information about an image found on a page.
InMemoryFrontier
A simple in-memory URL frontier with deduplication.
InteractionResult
Result of executing a sequence of page interaction actions.
JsonLdEntry
A JSON-LD structured data entry found on a page.
LearningRetryPolicy
Retry policy that consults a DomainStatePort for the per-domain prior on each decision. Falls back to SimpleRetryPolicy semantics when no state is available for the domain.
LinkInfo
Information about a link found on a page.
MapResult
The result of a map operation, containing discovered URLs.
MarkdownResult
Rich markdown conversion result from HTML processing.
NoopCache
No-op cache that never stores or returns anything.
NoopEmitter
An event emitter that does nothing – all events are silently discarded.
NoopFilter
A content filter that passes everything through without modification.
NoopStore
A store that does nothing – crawl results are discarded.
PageMetadata
Metadata extracted from an HTML page’s <meta> tags and <title> element.
PerDomainThrottle
A per-domain token bucket rate limiter.
ProxyConfig
Proxy configuration for HTTP requests.
ResponseMeta
Response metadata extracted from HTTP headers.
ScrapeResult
The result of a single-page scrape operation.
SimpleRetryPolicy
Per-error mapping with no learning. The simplest possible RetryPolicy — useful as a baseline and as a fallback when no state backend is configured.
SitemapUrl
A URL entry from a sitemap.
TomlClassifier
Default WafClassifier backed by a TOML fingerprint corpus.
UnlimitedBudget
EscalationBudget that always permits escalation. Used by default when no budget is configured on CrawlConfig.
WafRules
Compiled WAF rules: fingerprint list + single Aho-Corasick automaton.
WafSignal
Output of a WAF classifier — a single fingerprint match.
WafWatchHandle
Drop-on-shutdown handle returned by TomlClassifier::watch.

Enums§

AntibotError
Errors from AntibotStrategy hook invocations.
AssetCategory
The category of a downloaded asset.
AuthConfig
Authentication configuration.
BrowserBackend
Browser backend used for JavaScript rendering.
BrowserMode
When to use the headless browser fallback.
BrowserWait
Wait strategy for browser page rendering.
CrawlError
Errors that can occur during crawling, scraping, or mapping operations.
CrawlEvent
An event emitted during a streaming crawl operation.
Decision
What the dispatch loop does after the antibot post_response hook runs.
EscalationReason
Why the dispatcher should escalate to the next tier.
EscalationStrategy
Defines the escalation chain when a tier produces a block signal.
FeedType
The type of a feed (RSS, Atom, or JSON Feed).
ImageSource
The source of an image reference.
LinkType
The classification of a link.
ObservedOutcome
Classification of a single fetch outcome.
RetryDirective
What the dispatcher does next, returned by RetryPolicy::decide.
Tier
Which tier produced the current attempt’s outcome.
WafClassifyError
Errors returned by WafClassifier::classify.
WafRulesError
Error returned when loading or validating a rules file.
WafWatchError
Error returned when setting up a WatchHandle.

Traits§

AntibotStrategy
Pluggable antibot hook pair.
BypassProvider
Caller-supplied bypass backend. Implementations are responsible for vendor authentication, request shaping, response decoding, and mapping vendor errors into CrawlError.
DomainStatePort
Persistent per-domain dispatch state.
EscalationBudget
Pluggable per-job escalation budget.
RetryPolicy
Pluggable per-attempt decision policy.
WafClassifier
Pluggable WAF detection.

Functions§

batch_crawl
Crawl multiple seed URLs concurrently, each following links to configured depth.
batch_crawl_stream
Stream a multi-URL crawl, yielding CrawlEvents across all seeds.
batch_scrape
Scrape multiple URLs concurrently.
crawl
Crawl a website starting from url, following links up to the configured depth.
crawl_stream
Stream a single-URL crawl, yielding CrawlEvents as pages are processed.
create_engine
Create a new crawl engine with the given configuration.
default_retry_policy
Convenience constructor: Arc<dyn RetryPolicy> for the default policy.
generate_citations
Convert markdown links to numbered citations.
in_memory_domain_state
Convenience constructor: Arc<dyn DomainStatePort> backed by an in-memory EWMA map.
interact
Execute browser actions on a single page.
map_urls
Discover all pages on a website by following links and sitemaps.
scrape
Scrape a single URL, returning extracted page data.
unlimited_budget
Convenience constructor: Arc<dyn EscalationBudget> that never blocks.
waf_rules_from_path
Load and compile rules from a TOML file on disk.
waf_rules_from_str
Load and compile rules from a TOML string.

Type Aliases§

DynAntibotStrategy
Convenience alias.
DynBypassProvider
Convenience type alias used on CrawlConfig.bypass.
DynDomainStatePort
Convenience alias for an owned, type-erased domain-state backend.
DynEscalationBudget
Convenience alias for an owned, type-erased budget.
DynRetryPolicy
Convenience alias for an owned, type-erased retry policy on crate::types::CrawlConfig.
DynWafClassifier
Convenience alias for an owned, type-erased WAF classifier.