Module utils

Source

Expand description

Application utils.

Modules§

abs: Absolute path domain handling.
adaptive_concurrencyadaptive_concurrency: AIMD-based adaptive concurrency controller. AIMD-based adaptive concurrency controller.
auto_throttleauto_throttle: Latency-based auto-throttle for adaptive crawl delay per domain. Latency-based auto-throttle for polite, adaptive crawling.
backoff: Exponential backoff with jitter for retry logic.
coalescerequest_coalesce: Request coalescing to dedup concurrent in-flight requests. Request coalescing to deduplicate concurrent in-flight requests for the same URL.
connect: Connect layer for reqwest.
css_selectors: Generic CSS selectors.
detect_systembalance or disk: CPU and Memory detection to balance limitations.
dns_cachedns_cache: DNS pre-resolution cache with TTL.
etag_cacheetag_cache: ETag / conditional-request cache for bandwidth-efficient re-crawls. ETag / conditional-request cache for bandwidth-efficient re-crawls.
frontierpriority_frontier: Prioritized URL frontier with dedup and optional domain round-robin. Prioritized URL frontier with deduplication.
h2_trackerh2_multiplex: HTTP/2 multiplexing tracker for per-origin stream management.
header_utils: Utils to modify the HTTP header.
hedgehedge: Work-stealing (hedged requests) for slow crawl requests.
html_spoolbalance: Disk-backed HTML spool for memory-balanced crawling. Disk-backed HTML spool for memory-balanced crawling.
interner: String interner.
numanuma: NUMA-aware thread pinning for multi-socket servers. NUMA-aware thread pinning for multi-socket servers.
parallel_backendsparallel_backends: Parallel crawl backends — race alternative engines alongside the primary crawl.
rate_limiterrate_limit: Per-domain token bucket rate limiter. Per-domain token bucket rate limiter.
robots_cacherobots_cache: Cross-crawl robots.txt cache with TTL-based expiry.
tab_poolchrome: Chrome tab pooling for reusing CDP tabs across page visits.
templates: Fragment templates.
trie: A trie struct.
uring_fs: Async file I/O with optional io_uring acceleration. Async file I/O with optional io_uring acceleration.
validation: Validate html false positives.
vitalsbalance: Lock-free crawl vitals for intelligent scaling (requests, bytes, pages, tabs).
warcwarc: WARC 1.1 file writer for web archive output. WARC 1.1 writer for web archive output.
zero_copyzero_copy: Zero-copy byte-level parsing for HTTP wire formats and protocol structures. Zero-copy byte-level parsing for HTTP wire formats and protocol structures.

Structs§

APACHE_FORBIDDEN: Apache server forbidden.
AllowedDomainTypes: Allow subdomains or tlds.
CONTROLLER: control handle for crawls
ChromeFetchParamschrome: Borrowed parameter bundle for chrome fetch/page methods.
ChromeHTTPReqReschrome: The chrome HTTP response.
CompiledCustomAntibot: Compiled custom antibot patterns for runtime matching. Built from CustomAntibotPatterns at crawl start.
EMPTY_HTML_BASIC: Empty html.
GEMINI_SEM: Semaphore for Gemini rate limiting
HttpRequestLikecache_chrome_hybrid or cache_chrome_hybrid_mem: A HTTP request type for caching.
HttpResponse: A basic generic type that represents an HTTP response.
HttpResponseLikecache_chrome_hybrid or cache_chrome_hybrid_mem: A HTTP response type for caching.
JsonResponseopenai: The json response from OpenAI.
OPEN_RESTY_FORBIDDEN: Open Resty forbidden.
PageResponse: The response of a web page.

Enums§

BasicCachePolicy: Basic cache policy.
CacheOptions: Cache options to use for the request.
FetchPageResultdecentralized and headers: Fetch a page with the headers returned.
Handlercontrol: determine action
HeaderSource: Accepts different header types (for flexibility).
HttpVersion: Represents an HTTP version

Statics§

IGNORE_CONTENT_TYPES: Ignore the content types.

Functions§

cache_auth_token: Cache auth token.
cache_chrome_responsechrome and (cache_chrome_hybrid or cache_chrome_hybrid_mem): Cache the chrome response
cache_skip_browser: Check if cache options indicate browser should be skipped when cached.
chrome_fulfill_headers_from_reqwestchrome: Fullfil the headers.
chunk_idle_timeout: Per-chunk idle timeout for body streaming. If no data arrives within this duration, the stream is terminated and any partial data collected so far is returned. Prevents tarpitting / slow-drip antibot attacks. Set via SPIDER_CHUNK_IDLE_TIMEOUT_SECS (default: 30 seconds, 0 = disabled).
clean_htmlopenai_slim_fit: Default cleaner used by the engine.
clean_html_base: Clean the html removing css and js (base).
clean_html_full: Clean the most extra properties in the html to fit the context. Removes nav/footer, trims meta, and prunes most attributes except id/class/data-*.
clean_html_raw: Clean the html removing css and js default (raw passthrough).
clean_html_slim: Clean the HTML to slim-fit models. This removes base64 images and heavy nodes.
convert_headerscache_chrome_hybrid or headers or cookies: Convert headers to header map
crawl_duration_expired: Check if the crawl duration is expired.
create_cache_keycache or cache_mem: Create the cache key.
create_cache_key_rawcache or cache_mem: Create the cache key from string.
create_output_pathchrome: Get the output path of a screenshot and create any parent folders if needed.
detect_anti_bot_from_body: Detect the anti-bot technology.
detect_anti_bot_from_headers: Detect from headers (optimized: minimal lookups, no allocations).
detect_anti_bot_tech_response: Detect the anti-bot used from the request.
detect_anti_bot_tech_response_custom: Detect the anti-bot used from the request, including custom user-supplied patterns.
detect_antibot_from_url: Detect antibot from url
detect_hard_forbidden_content: Detect if a page is forbidden and should not retry.
detect_open_resty_forbidden: Detect if openresty hard 403 is forbidden and should not retry.
emit_logtracing: Emit a log info event.
emit_log_shutdowntracing: Emit a log info event.
fetch_pagedecentralized: Perform a network request to a resource extracting all content as text.
fetch_page_and_headersdecentralized and headers: Perform a network request to a resource with the response headers..
fetch_page_htmlfs and chrome: Perform a network request to a resource extracting all content as text streaming. Perform a network request to a resource extracting all content as text streaming via chrome.
fetch_page_html_chromechrome: Perform a network request to a resource extracting all content as text streaming via chrome.
fetch_page_html_chrome_basechrome: Perform a network request to a resource extracting all content as text streaming via chrome.
fetch_page_html_chrome_seededchrome: Perform a network request to a resource extracting all content as text streaming via chrome seeded.
fetch_page_html_raw: Perform a network request to a resource extracting all content streaming.
fetch_page_html_raw_cached: Perform a network request to a resource and return a cached response immediately when available.
fetch_page_html_raw_conditionaletag_cache: Perform a conditional network request using cached ETag / Last-Modified validators.
fetch_page_html_raw_only_html: Perform a network request to a resource extracting all content streaming.
fetch_page_html_seededfs and chrome: Seeded variant of fetch_page_html for the fs + chrome build.
fetch_page_html_spider_cloudspider_cloud: Fetch a single page via the spider.cloud REST API.
fetch_page_html_with_fallbackspider_cloud: Fetch with spider.cloud fallback.
flip_http_https: Flip http -> https protocols.
gemini_requestgemini and cache_gemini: Perform a request to Gemini Chat with caching.
gemini_request_basegemini: Perform a request to Gemini Chat.
get_cached_urlcache or cache_mem or chrome_remote_cache: Perform a network request to a resource extracting all content as text streaming via chrome.
get_cached_url_basecache or cache_mem: Perform a network request to a resource extracting all content as text streaming via chrome.
get_cookiescookies: The response cookies mapped. This does nothing without the cookies feature flag enabled.
get_final_redirectchrome: Get the base redirect for the website.
get_last_redirectchrome: Check if url matches the last item in a redirect chain for chrome CDP
get_last_segment: Get the last segment path.
get_semaphorebalance: Return the semaphore that should be used. Takes the worse of CPU and process memory pressure to drive throttling.
handle_ai_dataopenai: Extract to JsonResponse struct. This does nothing without ‘openai’ feature flag.
handle_extra_ai_dataopenai: Handle extra OpenAI data used. This does nothing without ‘openai’ feature flag.
handle_gemini_creditsgemini: Handle the Gemini credits used.
handle_openai_creditsopenai: Handle the OpenAI credits used. This does nothing without ‘openai’ feature flag.
handle_response_bytes: Handle the response bytes
handle_response_bytes_writer: Handle the response bytes writing links while crawling
is_cacheable_body_empty: Returns true if the body should NOT be cached (empty, near-empty, or known-bad HTML).
is_html_content_check: Check if the content is HTML.
log: Log to console if configuration verbose.
networking_capable: Determine if networking is capable for a URL. Uses first-byte dispatch to avoid 4 redundant prefix scans.
openai_requestopenai and cache_openai: Perform a request to OpenAI Chat. This does nothing without the ‘openai’ flag enabled.
openai_request_baseopenai: Perform a request to OpenAI Chat. This does nothing without the ‘openai’ flag enabled.
page_waitchrome: Wait for page events in two phases:
pausecontrol: Pause a target website running crawl. The crawl_id is prepended directly to the domain and required if set. ex: d22323edsd-https://mydomain.com
perform_chrome_http_requestchrome: Perform a chrome http request.
perform_screenshotchrome: Perform a screenshot shortcut.
perform_smart_mouse_movementreal_browser and chrome: Generate random mouse movement. This does nothing without the ‘real_browser’ flag enabled.
prepare_url: Prepare the url for parsing if it fails. Use this method if the url does not start with http or https.
put_hybrid_cachecache_chrome_hybrid or cache_chrome_hybrid_mem: Store the page to cache to be re-used across HTTP request.
resetcontrol: Reset a target website crawl. The crawl_id is prepended directly to the domain and required if set. ex: d22323edsd-https://mydomain.com
resumecontrol: Resume a target website crawl. The crawl_id is prepended directly to the domain and required if set. ex: d22323edsd-https://mydomain.com
run_navigate_or_content_set_corechrome and non-chrome_remote_cache: Core logic: either set document content or navigate.
run_openai_requestchrome and openai: Use OpenAI to extend the crawl. This does nothing without ‘openai’ feature flag.
shutdowncontrol: Shutdown a target website crawl. The crawl_id is prepended directly to the domain and required if set. ex: d22323edsd-https://mydomain.com
split_hashset_round_robin: Consumes set and returns (left, right), where left are items matching pred.
strip_www: Strip the www. prefix from a URL’s host, if present. Returns None if the URL has no www. prefix. Example: https://www.docs.github.com/foo → https://docs.github.com/foo
wait_for_domchrome: wait for dom to finish updating target selector
wait_for_eventchrome: wait for event with timeout
wait_for_selectorchrome: wait for a selector

Module utils

Module utils Copy item path

Modules§

Structs§

Enums§

Statics§

Functions§

Module utils