Skip to main content

Module utils

Module utils 

Source
Expand description

Application utils.

Modules§

abs
Absolute path domain handling.
adaptive_concurrencyadaptive_concurrency
AIMD-based adaptive concurrency controller. AIMD-based adaptive concurrency controller.
auto_throttleauto_throttle
Latency-based auto-throttle for adaptive crawl delay per domain. Latency-based auto-throttle for polite, adaptive crawling.
backoff
Exponential backoff with jitter for retry logic.
coalescerequest_coalesce
Request coalescing to dedup concurrent in-flight requests. Request coalescing to deduplicate concurrent in-flight requests for the same URL.
connect
Connect layer for reqwest.
css_selectors
Generic CSS selectors.
detect_systembalance or disk
CPU and Memory detection to balance limitations.
dns_cachedns_cache
DNS pre-resolution cache with TTL.
etag_cacheetag_cache
ETag / conditional-request cache for bandwidth-efficient re-crawls. ETag / conditional-request cache for bandwidth-efficient re-crawls.
frontierpriority_frontier
Prioritized URL frontier with dedup and optional domain round-robin. Prioritized URL frontier with deduplication.
h2_trackerh2_multiplex
HTTP/2 multiplexing tracker for per-origin stream management.
header_utils
Utils to modify the HTTP header.
hedgehedge
Work-stealing (hedged requests) for slow crawl requests.
html_spoolbalance
Disk-backed HTML spool for memory-balanced crawling. Disk-backed HTML spool for memory-balanced crawling.
interner
String interner.
numanuma
NUMA-aware thread pinning for multi-socket servers. NUMA-aware thread pinning for multi-socket servers.
parallel_backendsparallel_backends
Parallel crawl backends — race alternative engines alongside the primary crawl.
rate_limiterrate_limit
Per-domain token bucket rate limiter. Per-domain token bucket rate limiter.
robots_cacherobots_cache
Cross-crawl robots.txt cache with TTL-based expiry.
tab_poolchrome
Chrome tab pooling for reusing CDP tabs across page visits.
templates
Fragment templates.
trie
A trie struct.
uring_fs
Async file I/O with optional io_uring acceleration. Async file I/O with optional io_uring acceleration.
validation
Validate html false positives.
vitalsbalance
Lock-free crawl vitals for intelligent scaling (requests, bytes, pages, tabs).
warcwarc
WARC 1.1 file writer for web archive output. WARC 1.1 writer for web archive output.
zero_copyzero_copy
Zero-copy byte-level parsing for HTTP wire formats and protocol structures. Zero-copy byte-level parsing for HTTP wire formats and protocol structures.

Structs§

APACHE_FORBIDDEN
Apache server forbidden.
AllowedDomainTypes
Allow subdomains or tlds.
CONTROLLER
control handle for crawls
ChromeFetchParamschrome
Borrowed parameter bundle for chrome fetch/page methods.
ChromeHTTPReqReschrome
The chrome HTTP response.
CompiledCustomAntibot
Compiled custom antibot patterns for runtime matching. Built from CustomAntibotPatterns at crawl start.
EMPTY_HTML_BASIC
Empty html.
GEMINI_SEM
Semaphore for Gemini rate limiting
HttpRequestLikecache_chrome_hybrid or cache_chrome_hybrid_mem
A HTTP request type for caching.
HttpResponse
A basic generic type that represents an HTTP response.
HttpResponseLikecache_chrome_hybrid or cache_chrome_hybrid_mem
A HTTP response type for caching.
JsonResponseopenai
The json response from OpenAI.
OPEN_RESTY_FORBIDDEN
Open Resty forbidden.
PageResponse
The response of a web page.

Enums§

BasicCachePolicy
Basic cache policy.
CacheOptions
Cache options to use for the request.
FetchPageResultdecentralized and headers
Fetch a page with the headers returned.
Handlercontrol
determine action
HeaderSource
Accepts different header types (for flexibility).
HttpVersion
Represents an HTTP version

Statics§

IGNORE_CONTENT_TYPES
Ignore the content types.

Functions§

cache_auth_token
Cache auth token.
cache_chrome_responsechrome and (cache_chrome_hybrid or cache_chrome_hybrid_mem)
Cache the chrome response
cache_skip_browser
Check if cache options indicate browser should be skipped when cached.
chrome_fulfill_headers_from_reqwestchrome
Fullfil the headers.
chunk_idle_timeout
Per-chunk idle timeout for body streaming. If no data arrives within this duration, the stream is terminated and any partial data collected so far is returned. Prevents tarpitting / slow-drip antibot attacks. Set via SPIDER_CHUNK_IDLE_TIMEOUT_SECS (default: 30 seconds, 0 = disabled).
clean_htmlopenai_slim_fit
Default cleaner used by the engine.
clean_html_base
Clean the html removing css and js (base).
clean_html_full
Clean the most extra properties in the html to fit the context. Removes nav/footer, trims meta, and prunes most attributes except id/class/data-*.
clean_html_raw
Clean the html removing css and js default (raw passthrough).
clean_html_slim
Clean the HTML to slim-fit models. This removes base64 images and heavy nodes.
convert_headerscache_chrome_hybrid or headers or cookies
Convert headers to header map
crawl_duration_expired
Check if the crawl duration is expired.
create_cache_keycache or cache_mem
Create the cache key.
create_cache_key_rawcache or cache_mem
Create the cache key from string.
create_output_pathchrome
Get the output path of a screenshot and create any parent folders if needed.
detect_anti_bot_from_body
Detect the anti-bot technology.
detect_anti_bot_from_headers
Detect from headers (optimized: minimal lookups, no allocations).
detect_anti_bot_tech_response
Detect the anti-bot used from the request.
detect_anti_bot_tech_response_custom
Detect the anti-bot used from the request, including custom user-supplied patterns.
detect_antibot_from_url
Detect antibot from url
detect_hard_forbidden_content
Detect if a page is forbidden and should not retry.
detect_open_resty_forbidden
Detect if openresty hard 403 is forbidden and should not retry.
emit_logtracing
Emit a log info event.
emit_log_shutdowntracing
Emit a log info event.
fetch_pagedecentralized
Perform a network request to a resource extracting all content as text.
fetch_page_and_headersdecentralized and headers
Perform a network request to a resource with the response headers..
fetch_page_htmlfs and chrome
Perform a network request to a resource extracting all content as text streaming. Perform a network request to a resource extracting all content as text streaming via chrome.
fetch_page_html_chromechrome
Perform a network request to a resource extracting all content as text streaming via chrome.
fetch_page_html_chrome_basechrome
Perform a network request to a resource extracting all content as text streaming via chrome.
fetch_page_html_chrome_seededchrome
Perform a network request to a resource extracting all content as text streaming via chrome seeded.
fetch_page_html_raw
Perform a network request to a resource extracting all content streaming.
fetch_page_html_raw_cached
Perform a network request to a resource and return a cached response immediately when available.
fetch_page_html_raw_conditionaletag_cache
Perform a conditional network request using cached ETag / Last-Modified validators.
fetch_page_html_raw_only_html
Perform a network request to a resource extracting all content streaming.
fetch_page_html_seededfs and chrome
Seeded variant of fetch_page_html for the fs + chrome build.
fetch_page_html_spider_cloudspider_cloud
Fetch a single page via the spider.cloud REST API.
fetch_page_html_with_fallbackspider_cloud
Fetch with spider.cloud fallback.
flip_http_https
Flip http -> https protocols.
gemini_requestgemini and cache_gemini
Perform a request to Gemini Chat with caching.
gemini_request_basegemini
Perform a request to Gemini Chat.
get_cached_urlcache or cache_mem or chrome_remote_cache
Perform a network request to a resource extracting all content as text streaming via chrome.
get_cached_url_basecache or cache_mem
Perform a network request to a resource extracting all content as text streaming via chrome.
get_cookiescookies
The response cookies mapped. This does nothing without the cookies feature flag enabled.
get_final_redirectchrome
Get the base redirect for the website.
get_last_redirectchrome
Check if url matches the last item in a redirect chain for chrome CDP
get_last_segment
Get the last segment path.
get_semaphorebalance
Return the semaphore that should be used. Takes the worse of CPU and process memory pressure to drive throttling.
handle_ai_dataopenai
Extract to JsonResponse struct. This does nothing without ‘openai’ feature flag.
handle_extra_ai_dataopenai
Handle extra OpenAI data used. This does nothing without ‘openai’ feature flag.
handle_gemini_creditsgemini
Handle the Gemini credits used.
handle_openai_creditsopenai
Handle the OpenAI credits used. This does nothing without ‘openai’ feature flag.
handle_response_bytes
Handle the response bytes
handle_response_bytes_writer
Handle the response bytes writing links while crawling
is_cacheable_body_empty
Returns true if the body should NOT be cached (empty, near-empty, or known-bad HTML).
is_html_content_check
Check if the content is HTML.
log
Log to console if configuration verbose.
networking_capable
Determine if networking is capable for a URL. Uses first-byte dispatch to avoid 4 redundant prefix scans.
openai_requestopenai and cache_openai
Perform a request to OpenAI Chat. This does nothing without the ‘openai’ flag enabled.
openai_request_baseopenai
Perform a request to OpenAI Chat. This does nothing without the ‘openai’ flag enabled.
page_waitchrome
Wait for page events in two phases:
pausecontrol
Pause a target website running crawl. The crawl_id is prepended directly to the domain and required if set. ex: d22323edsd-https://mydomain.com
perform_chrome_http_requestchrome
Perform a chrome http request.
perform_screenshotchrome
Perform a screenshot shortcut.
perform_smart_mouse_movementreal_browser and chrome
Generate random mouse movement. This does nothing without the ‘real_browser’ flag enabled.
prepare_url
Prepare the url for parsing if it fails. Use this method if the url does not start with http or https.
put_hybrid_cachecache_chrome_hybrid or cache_chrome_hybrid_mem
Store the page to cache to be re-used across HTTP request.
resetcontrol
Reset a target website crawl. The crawl_id is prepended directly to the domain and required if set. ex: d22323edsd-https://mydomain.com
resumecontrol
Resume a target website crawl. The crawl_id is prepended directly to the domain and required if set. ex: d22323edsd-https://mydomain.com
run_navigate_or_content_set_corechrome and non-chrome_remote_cache
Core logic: either set document content or navigate.
run_openai_requestchrome and openai
Use OpenAI to extend the crawl. This does nothing without ‘openai’ feature flag.
shutdowncontrol
Shutdown a target website crawl. The crawl_id is prepended directly to the domain and required if set. ex: d22323edsd-https://mydomain.com
split_hashset_round_robin
Consumes set and returns (left, right), where left are items matching pred.
strip_www
Strip the www. prefix from a URL’s host, if present. Returns None if the URL has no www. prefix. Example: https://www.docs.github.com/foohttps://docs.github.com/foo
wait_for_domchrome
wait for dom to finish updating target selector
wait_for_eventchrome
wait for event with timeout
wait_for_selectorchrome
wait for a selector