Crate crawlberg

Source

Expand description

crawlberg – A Rust crawling engine for turning websites into structured data.

Re-exports§

pub use budget::BudgetError;
pub use budget::DefaultPageBudget;
pub use budget::PageBudget;
pub use interact::MAX_ACTIONS;
pub use interact::MAX_SCRIPT_LEN;
pub use interact::MAX_SCROLL_AMOUNT;
pub use interact::MAX_SELECTOR_LEN;
pub use interact::MAX_SINGLE_WAIT_MS;
pub use interact::MAX_TEXT_LEN;
pub use interact::MAX_TOTAL_WAIT_SECS;
pub use interact::PageAction;
pub use interact::ScrollDirection;
pub use interact::validate_actions;
pub use net::ssrf::HostMatcher;
pub use net::ssrf::SsrfError;
pub use net::ssrf::SsrfPolicy;
pub use net::ssrf::validate_url;
pub use proxy::ProxyProvider;
pub use proxy::StaticProxyProvider;
pub use sink::EventSink;
pub use sink::MultiEventSink;
pub use sink::TracingEventSink;
pub use telemetry::current_traceparent;
pub use telemetry::with_traceparent;

Modules§

budget: Pluggable page budget hook for controlling crawl extent.
http: HTTP fetching with redirect handling, retry logic, and cookie extraction.
interact: Page interaction module for action-based browser automation.
net: Network utilities: SSRF policy, validation, and security.
proxy: Proxy provider trait + baseline impl.
robots: Robots.txt parsing and path-matching logic.
sink: Pluggable event sink for streaming crawl events.
sitemap: Sitemap XML parsing and recursive fetching.
telemetry: OpenTelemetry foundations for crawlberg.
traits: Trait-based extension points for the crawl engine.

Structs§

ActionResult: Result from a single page action execution.
AdaptiveStrategy: Adaptive crawling strategy that stops when content saturation is detected.
ArticleMetadata: Article metadata extracted from article:* Open Graph tags.
AttemptOutcome: Rich context passed to RetryPolicy::decide on each attempt.
BatchCrawlResult: Result from a single URL in a batch crawl operation.
BatchCrawlResults: Aggregate result of a batch crawl, exposing per-URL results plus precomputed counts.
BatchCrawlStreamRequest: Request to begin a multi-URL streaming crawl.
BatchScrapeResult: Result from a single URL in a batch scrape operation.
BatchScrapeResults: Aggregate result of a batch scrape, exposing per-URL results plus precomputed counts.
BestFirstStrategy: A best-first crawl strategy.
BfsStrategy: A breadth-first crawl strategy.
BrowserConfig: Browser fallback configuration.
BrowserExtras: Browser-specific extras populated when the native browser backend was used.
BudgetExhausted: Returned by EscalationBudget::try_consume when no budget remains.
BypassResponse: Response returned by a BypassProvider::fetch call.
CachedPage: Cached page data for HTTP response caching.
CitationReference: A single numbered reference in a citation list — produced by the citation extractor when content uses inline [N]-style markers.
CitationResult: Result of citation conversion.
ContentConfig: Content extraction and conversion configuration.
CookieInfo: Information about an HTTP cookie received from a response.
CrawlConfig: Configuration for crawl, scrape, and map operations.
CrawlConfigBuilder: Fluent builder for CrawlConfig.
CrawlEngine: The main crawl engine, composed of pluggable trait implementations.
CrawlEngineBuilder: Builder for CrawlEngine.
CrawlEngineHandle: Opaque handle to a configured crawl engine.
CrawlPageResult: The result of crawling a single page during a crawl operation.
CrawlResult: The result of a multi-page crawl operation.
CrawlStreamRequest: Request to begin a single-URL streaming crawl.
DefaultAntibotStrategy: Default AntibotStrategy that mirrors the pre-Cluster-5 engine behaviour.
DfsStrategy: A depth-first crawl strategy.
DispatchProfile: Bundle of pluggable dispatch components attached to crate::types::CrawlConfig.
DispatchProfileBuilder: Fluent builder for DispatchProfile.
DomainObservation: Single fetch outcome reported to DomainStatePort::observe. The backend turns these into its own state model (EWMA, rule-based, histogram, etc).
DomainRecommendation: Recommendation returned by DomainStatePort::recommend for the next fetch attempt against a domain. Generic over the backend’s internal model — the only data the engine needs to act on is which tier to start at and how confident the backend is in that choice.
DownloadedAsset: A downloaded asset from a page.
DownloadedDocument: A downloaded non-HTML document (PDF, DOCX, image, code file, etc.).
EwmaDomainState: Process-local domain state backed by an EWMA block-rate model. DashMap-backed, ephemeral — no persistence across restarts. For multi-process / multi-tenant learning, use xberg-enterprise’s PostgresDomainState.
EwmaTracker: Pure-math EWMA with promote/demote thresholds. Stateless — caller supplies the prior and the observation.
ExtractionMeta: Metadata about an LLM extraction pass.
FaviconInfo: Information about a favicon or icon link.
FeedInfo: Information about a feed link found on a page.
FixedBudget: EscalationBudget backed by an atomic counter. Decrements on each try_consume; returns Err(BudgetExhausted) once the remaining budget can’t cover the request. Useful for self-hosters that want per-process spend caps without a database.
HeadingInfo: A heading element extracted from the page.
HreflangEntry: An hreflang alternate link entry.
ImageInfo: Information about an image found on a page.
InMemoryFrontier: A simple in-memory URL frontier with deduplication.
InteractionResult: Result of executing a sequence of page interaction actions.
JsonLdEntry: A JSON-LD structured data entry found on a page.
LearningRetryPolicy: Retry policy that consults a DomainStatePort for the per-domain prior on each decision. Falls back to SimpleRetryPolicy semantics when no state is available for the domain.
LinkInfo: Information about a link found on a page.
MapResult: The result of a map operation, containing discovered URLs.
MarkdownResult: Rich markdown conversion result from HTML processing.
NoopCache: No-op cache that never stores or returns anything.
NoopEmitter: An event emitter that does nothing – all events are silently discarded.
NoopFilter: A content filter that passes everything through without modification.
NoopStore: A store that does nothing – crawl results are discarded.
PageMetadata: Metadata extracted from an HTML page’s <meta> tags and <title> element.
PerDomainThrottle: A per-domain token bucket rate limiter.
ProxyConfig: Proxy configuration for HTTP requests.
ResponseMeta: Response metadata extracted from HTTP headers.
ScrapeResult: The result of a single-page scrape operation.
SimpleRetryPolicy: Per-error mapping with no learning. The simplest possible RetryPolicy — useful as a baseline and as a fallback when no state backend is configured.
SitemapUrl: A URL entry from a sitemap.
TomlClassifier: Default WafClassifier backed by a TOML fingerprint corpus.
UnlimitedBudget: EscalationBudget that always permits escalation. Used by default when no budget is configured on CrawlConfig.
WafRules: Compiled WAF rules: fingerprint list + single Aho-Corasick automaton.
WafSignal: Output of a WAF classifier — a single fingerprint match.
WafWatchHandle: Drop-on-shutdown handle returned by TomlClassifier::watch.

Enums§

AntibotError: Errors from AntibotStrategy hook invocations.
AssetCategory: The category of a downloaded asset.
AuthConfig: Authentication configuration.
BrowserBackend: Browser backend used for JavaScript rendering.
BrowserMode: When to use the headless browser fallback.
BrowserWait: Wait strategy for browser page rendering.
CrawlError: Errors that can occur during crawling, scraping, or mapping operations.
CrawlEvent: An event emitted during a streaming crawl operation.
Decision: What the dispatch loop does after the antibot post_response hook runs.
EscalationReason: Why the dispatcher should escalate to the next tier.
EscalationStrategy: Defines the escalation chain when a tier produces a block signal.
FeedType: The type of a feed (RSS, Atom, or JSON Feed).
ImageSource: The source of an image reference.
LinkType: The classification of a link.
ObservedOutcome: Classification of a single fetch outcome.
RetryDirective: What the dispatcher does next, returned by RetryPolicy::decide.
Tier: Which tier produced the current attempt’s outcome.
WafClassifyError: Errors returned by WafClassifier::classify.
WafRulesError: Error returned when loading or validating a rules file.
WafWatchError: Error returned when setting up a WatchHandle.

Traits§

AntibotStrategy: Pluggable antibot hook pair.
BypassProvider: Caller-supplied bypass backend. Implementations are responsible for vendor authentication, request shaping, response decoding, and mapping vendor errors into CrawlError.
DomainStatePort: Persistent per-domain dispatch state.
EscalationBudget: Pluggable per-job escalation budget.
RetryPolicy: Pluggable per-attempt decision policy.
WafClassifier: Pluggable WAF detection.

Functions§

batch_crawl: Crawl multiple seed URLs concurrently, each following links to configured depth.
batch_crawl_stream: Stream a multi-URL crawl, yielding CrawlEvents across all seeds.
batch_scrape: Scrape multiple URLs concurrently.
crawl: Crawl a website starting from url, following links up to the configured depth.
crawl_stream: Stream a single-URL crawl, yielding CrawlEvents as pages are processed.
create_engine: Create a new crawl engine with the given configuration.
default_retry_policy: Convenience constructor: Arc<dyn RetryPolicy> for the default policy.
generate_citations: Convert markdown links to numbered citations.
in_memory_domain_state: Convenience constructor: Arc<dyn DomainStatePort> backed by an in-memory EWMA map.
interact: Execute browser actions on a single page.
map_urls: Discover all pages on a website by following links and sitemaps.
scrape: Scrape a single URL, returning extracted page data.
unlimited_budget: Convenience constructor: Arc<dyn EscalationBudget> that never blocks.
waf_rules_from_path: Load and compile rules from a TOML file on disk.
waf_rules_from_str: Load and compile rules from a TOML string.

Type Aliases§

DynAntibotStrategy: Convenience alias.
DynBypassProvider: Convenience type alias used on CrawlConfig.bypass.
DynDomainStatePort: Convenience alias for an owned, type-erased domain-state backend.
DynEscalationBudget: Convenience alias for an owned, type-erased budget.
DynRetryPolicy: Convenience alias for an owned, type-erased retry policy on crate::types::CrawlConfig.
DynWafClassifier: Convenience alias for an owned, type-erased WAF classifier.

Crate crawlberg

Crate crawlberg Copy item path

Re-exports§

Modules§

Structs§

Enums§

Traits§

Functions§

Type Aliases§

Crate crawlberg