Skip to main content

Crate crawlex

Crate crawlex 

Source

Re-exports§

pub use config::Config;
pub use config::ConfigBuilder;
pub use crawler::Crawler;
pub use error::Error;
pub use error::Result;
pub use hooks::HookDecision;
pub use hooks::HookEvent;

Modules§

antibot
Phase 1 antibot/stealth module — rich vendor detection, per-session ChallengeState, and pure detect_from_* functions.
cli
config
crawler
discovery
error
Error taxonomy for crawlex.
escalation
Heuristic to decide when a FetchMethod::Auto job should be re-queued as Render because the HTTP spoof response doesn’t have the real content.
events
NDJSON event bus — public contract between the crawler runtime and any external consumer (CLI piping, SDKs, hooks).
extract
Content extraction and link filtering.
frontier
hooks
Lifecycle hooks — pluggable callbacks fired at well-known points in the crawl pipeline.
http
HTTP-layer helpers owned by the crawl-pattern scheduler (wave 1).
identity
IdentityBundle — coherent “who is the browser pretending to be?” record.
impersonate
intel
Target-scoped infrastructure-intel orchestrator (Fase B).
metrics
Performance KPIs — network timings for the HTTP path, Core Web Vitals and runtime counters for the render path. Both populated per-page; stored in page_metrics and exposed to hooks via HookContext.user_data["metrics"].
metrics_server
Minimal Prometheus scrape endpoint using a hand-rolled HTTP/1.0 responder.
policy
Policy Engine — deterministic, explainable crawl decisions.
proxy
queue
render
Render/browser path. Only compiled when cdp-backend is enabled. crawlex-mini builds without this module; callers that need runtime “render not available” errors use Error::RenderDisabled.
robots
scheduler
Render scheduler with per-host / per-origin / per-proxy / per-session inflight budgets. Sits in front of RenderPool so a single noisy origin can’t monopolise the browser or trigger rate limits upstream.
script
ScriptSpec v1 — unified AST for declarative crawl scripts.
storage
url_util
wait_strategy
Page-load wait strategy.