Skip to main content

Module utils

Module utils 

Source
Expand description

Application utils.

Modules§

abs
Absolute path domain handling.
adaptive_concurrencyadaptive_concurrency
AIMD-based adaptive concurrency controller. AIMD-based adaptive concurrency controller.
auto_throttleauto_throttle
Latency-based auto-throttle for adaptive crawl delay per domain. Latency-based auto-throttle for polite, adaptive crawling.
backoff
Exponential backoff with jitter for retry logic.
coalescerequest_coalesce
Request coalescing to dedup concurrent in-flight requests. Request coalescing to deduplicate concurrent in-flight requests for the same URL.
connect
Connect layer for reqwest.
css_selectors
Generic CSS selectors.
detect_systembalance or disk
CPU and Memory detection to balance limitations.
dns_cachedns_cache
DNS pre-resolution cache with TTL.
etag_cacheetag_cache
ETag / conditional-request cache for bandwidth-efficient re-crawls. ETag / conditional-request cache for bandwidth-efficient re-crawls.
frontierpriority_frontier
Prioritized URL frontier with dedup and optional domain round-robin. Prioritized URL frontier with deduplication.
h2_trackerh2_multiplex
HTTP/2 multiplexing tracker for per-origin stream management.
header_utils
Utils to modify the HTTP header.
hedgehedge
Work-stealing (hedged requests) for slow crawl requests.
html_spoolbalance
Disk-backed HTML spool for memory-balanced crawling. Disk-backed HTML spool for memory-balanced crawling.
interner
String interner.
numanuma
NUMA-aware thread pinning for multi-socket servers. NUMA-aware thread pinning for multi-socket servers.
parallel_backendsparallel_backends
Parallel crawl backends — race LightPanda / Servo alongside the primary crawl.
rate_limiterrate_limit
Per-domain token bucket rate limiter. Per-domain token bucket rate limiter.
robots_cacherobots_cache
Cross-crawl robots.txt cache with TTL-based expiry.
tab_poolchrome
Chrome tab pooling for reusing CDP tabs across page visits.
templates
Fragment templates.
trie
A trie struct.
uring_fs
Async file I/O with optional io_uring acceleration. Async file I/O with optional io_uring acceleration.
validation
Validate html false positives.
warcwarc
WARC 1.1 file writer for web archive output. WARC 1.1 writer for web archive output.
zero_copyzero_copy
Zero-copy byte-level parsing for HTTP wire formats and protocol structures. Zero-copy byte-level parsing for HTTP wire formats and protocol structures.

Structs§

APACHE_FORBIDDEN
Apache server forbidden.
AllowedDomainTypes
Allow subdomains or tlds.
CONTROLLER
control handle for crawls
ChromeHTTPReqReschrome
The chrome HTTP response.
CompiledCustomAntibot
Compiled custom antibot patterns for runtime matching. Built from CustomAntibotPatterns at crawl start.
EMPTY_HTML_BASIC
Empty html.
GEMINI_SEM
Semaphore for Gemini rate limiting
HttpRequestLikecache_chrome_hybrid or cache_chrome_hybrid_mem
A HTTP request type for caching.
HttpResponse
A basic generic type that represents an HTTP response.
HttpResponseLikecache_chrome_hybrid or cache_chrome_hybrid_mem
A HTTP response type for caching.
JsonResponseopenai
The json response from OpenAI.
OPEN_RESTY_FORBIDDEN
Open Resty forbidden.
PageResponse
The response of a web page.

Enums§

BasicCachePolicy
Basic cache policy.
CacheOptions
Cache options to use for the request.
FetchPageResultdecentralized and headers
Fetch a page with the headers returned.
Handlercontrol
determine action
HeaderSource
Accepts different header types (for flexibility).
HttpVersion
Represents an HTTP version

Statics§

IGNORE_CONTENT_TYPES
Ignore the content types.

Functions§

cache_auth_token
Cache auth token.
cache_chrome_responsechrome and (cache_chrome_hybrid or cache_chrome_hybrid_mem)
Cache the chrome response
cache_skip_browser
Check if cache options indicate browser should be skipped when cached.
chrome_fulfill_headers_from_reqwestchrome
Fullfil the headers.
chunk_idle_timeout
Per-chunk idle timeout for body streaming. If no data arrives within this duration, the stream is terminated and any partial data collected so far is returned. Prevents tarpitting / slow-drip antibot attacks. Set via SPIDER_CHUNK_IDLE_TIMEOUT_SECS (default: 30 seconds, 0 = disabled).
clean_htmlopenai_slim_fit
Default cleaner used by the engine.
clean_html_base
Clean the html removing css and js (base).
clean_html_full
Clean the most extra properties in the html to fit the context. Removes nav/footer, trims meta, and prunes most attributes except id/class/data-*.
clean_html_raw
Clean the html removing css and js default (raw passthrough).
clean_html_slim
Clean the HTML to slim-fit models. This removes base64 images and heavy nodes.
convert_headerscache_chrome_hybrid or headers or cookies
Convert headers to header map
crawl_duration_expired
Check if the crawl duration is expired.
create_cache_keycache or cache_mem
Create the cache key.
create_cache_key_rawcache or cache_mem
Create the cache key from string.
create_output_pathchrome
Get the output path of a screenshot and create any parent folders if needed.
detect_anti_bot_from_body
Detect the anti-bot technology.
detect_anti_bot_from_headers
Detect from headers (optimized: minimal lookups, no allocations).
detect_anti_bot_tech_response
Detect the anti-bot used from the request.
detect_anti_bot_tech_response_custom
Detect the anti-bot used from the request, including custom user-supplied patterns.
detect_antibot_from_url
Detect antibot from url
detect_hard_forbidden_content
Detect if a page is forbidden and should not retry.
detect_open_resty_forbidden
Detect if openresty hard 403 is forbidden and should not retry.
emit_logtracing
Emit a log info event.
emit_log_shutdowntracing
Emit a log info event.
fetch_pagedecentralized
Perform a network request to a resource extracting all content as text.
fetch_page_and_headersdecentralized and headers
Perform a network request to a resource with the response headers..
fetch_page_htmlfs and chrome
Perform a network request to a resource extracting all content as text streaming. Perform a network request to a resource extracting all content as text streaming via chrome.
fetch_page_html_chromechrome
Perform a network request to a resource extracting all content as text streaming via chrome.
fetch_page_html_chrome_basechrome
Perform a network request to a resource extracting all content as text streaming via chrome.
fetch_page_html_chrome_seededchrome
Perform a network request to a resource extracting all content as text streaming via chrome seeded.
fetch_page_html_raw
Perform a network request to a resource extracting all content streaming.
fetch_page_html_raw_cached
Perform a network request to a resource and return a cached response immediately when available.
fetch_page_html_raw_conditionaletag_cache
Perform a conditional network request using cached ETag / Last-Modified validators.
fetch_page_html_raw_only_html
Perform a network request to a resource extracting all content streaming.
fetch_page_html_spider_cloudspider_cloud
Fetch a single page via the spider.cloud REST API.
fetch_page_html_with_fallbackspider_cloud
Fetch with spider.cloud fallback.
flip_http_https
Flip http -> https protocols.
gemini_requestgemini and cache_gemini
Perform a request to Gemini Chat with caching.
gemini_request_basegemini
Perform a request to Gemini Chat.
get_cached_urlcache or cache_mem or chrome_remote_cache
Perform a network request to a resource extracting all content as text streaming via chrome.
get_cached_url_basecache or cache_mem
Perform a network request to a resource extracting all content as text streaming via chrome.
get_cookiescookies
The response cookies mapped. This does nothing without the cookies feature flag enabled.
get_final_redirectchrome
Get the base redirect for the website.
get_last_redirectchrome
Check if url matches the last item in a redirect chain for chrome CDP
get_last_segment
Get the last segment path.
get_semaphorebalance
Return the semaphore that should be used. Takes the worse of CPU and process memory pressure to drive throttling.
handle_ai_dataopenai
Extract to JsonResponse struct. This does nothing without ‘openai’ feature flag.
handle_extra_ai_dataopenai
Handle extra OpenAI data used. This does nothing without ‘openai’ feature flag.
handle_gemini_creditsgemini
Handle the Gemini credits used.
handle_openai_creditsopenai
Handle the OpenAI credits used. This does nothing without ‘openai’ feature flag.
handle_response_bytes
Handle the response bytes
handle_response_bytes_writer
Handle the response bytes writing links while crawling
is_cacheable_body_empty
Returns true if the body should NOT be cached (empty, near-empty, or known-bad HTML).
is_html_content_check
Check if the content is HTML.
log
Log to console if configuration verbose.
networking_capable
Determine if networking is capable for a URL.
openai_requestopenai and cache_openai
Perform a request to OpenAI Chat. This does nothing without the ‘openai’ flag enabled.
openai_request_baseopenai
Perform a request to OpenAI Chat. This does nothing without the ‘openai’ flag enabled.
page_waitchrome
Wait for page events.
pausecontrol
Pause a target website running crawl. The crawl_id is prepended directly to the domain and required if set. ex: d22323edsd-https://mydomain.com
perform_chrome_http_requestchrome
Perform a chrome http request.
perform_screenshotchrome
Perform a screenshot shortcut.
perform_smart_mouse_movementreal_browser and chrome
Generate random mouse movement. This does nothing without the ‘real_browser’ flag enabled.
prepare_url
Prepare the url for parsing if it fails. Use this method if the url does not start with http or https.
put_hybrid_cachecache_chrome_hybrid or cache_chrome_hybrid_mem
Store the page to cache to be re-used across HTTP request.
resetcontrol
Reset a target website crawl. The crawl_id is prepended directly to the domain and required if set. ex: d22323edsd-https://mydomain.com
resumecontrol
Resume a target website crawl. The crawl_id is prepended directly to the domain and required if set. ex: d22323edsd-https://mydomain.com
run_navigate_or_content_set_corechrome and non-chrome_remote_cache
Core logic: either set document content or navigate.
run_openai_requestchrome and openai
Use OpenAI to extend the crawl. This does nothing without ‘openai’ feature flag.
shutdowncontrol
Shutdown a target website crawl. The crawl_id is prepended directly to the domain and required if set. ex: d22323edsd-https://mydomain.com
split_hashset_round_robin
Consumes set and returns (left, right), where left are items matching pred.
strip_www
Strip the www. prefix from a URL’s host, if present. Returns None if the URL has no www. prefix. Example: https://www.docs.github.com/foohttps://docs.github.com/foo
wait_for_domchrome
wait for dom to finish updating target selector
wait_for_eventchrome
wait for event with timeout
wait_for_selectorchrome
wait for a selector