Expand description
Application utils.
Modules§
- abs
- Absolute path domain handling.
- adaptive_
concurrency adaptive_concurrency - AIMD-based adaptive concurrency controller. AIMD-based adaptive concurrency controller.
- auto_
throttle auto_throttle - Latency-based auto-throttle for adaptive crawl delay per domain. Latency-based auto-throttle for polite, adaptive crawling.
- backoff
- Exponential backoff with jitter for retry logic.
- coalesce
request_coalesce - Request coalescing to dedup concurrent in-flight requests. Request coalescing to deduplicate concurrent in-flight requests for the same URL.
- connect
- Connect layer for reqwest.
- css_
selectors - Generic CSS selectors.
- detect_
system balanceordisk - CPU and Memory detection to balance limitations.
- dns_
cache dns_cache - DNS pre-resolution cache with TTL.
- etag_
cache etag_cache - ETag / conditional-request cache for bandwidth-efficient re-crawls. ETag / conditional-request cache for bandwidth-efficient re-crawls.
- frontier
priority_frontier - Prioritized URL frontier with dedup and optional domain round-robin. Prioritized URL frontier with deduplication.
- h2_
tracker h2_multiplex - HTTP/2 multiplexing tracker for per-origin stream management.
- header_
utils - Utils to modify the HTTP header.
- hedge
hedge - Work-stealing (hedged requests) for slow crawl requests.
- html_
spool balance - Disk-backed HTML spool for memory-balanced crawling. Disk-backed HTML spool for memory-balanced crawling.
- interner
- String interner.
- numa
numa - NUMA-aware thread pinning for multi-socket servers. NUMA-aware thread pinning for multi-socket servers.
- parallel_
backends parallel_backends - Parallel crawl backends — race LightPanda / Servo alongside the primary crawl.
- rate_
limiter rate_limit - Per-domain token bucket rate limiter. Per-domain token bucket rate limiter.
- robots_
cache robots_cache - Cross-crawl robots.txt cache with TTL-based expiry.
- tab_
pool chrome - Chrome tab pooling for reusing CDP tabs across page visits.
- templates
- Fragment templates.
- trie
- A trie struct.
- uring_
fs - Async file I/O with optional io_uring acceleration. Async file I/O with optional io_uring acceleration.
- validation
- Validate html false positives.
- warc
warc - WARC 1.1 file writer for web archive output. WARC 1.1 writer for web archive output.
- zero_
copy zero_copy - Zero-copy byte-level parsing for HTTP wire formats and protocol structures. Zero-copy byte-level parsing for HTTP wire formats and protocol structures.
Structs§
- APACHE_
FORBIDDEN - Apache server forbidden.
- Allowed
Domain Types - Allow subdomains or tlds.
- CONTROLLER
- control handle for crawls
- ChromeHTTP
ReqRes chrome - The chrome HTTP response.
- Compiled
Custom Antibot - Compiled custom antibot patterns for runtime matching.
Built from
CustomAntibotPatternsat crawl start. - EMPTY_
HTML_ BASIC - Empty html.
- GEMINI_
SEM - Semaphore for Gemini rate limiting
- Http
Request Like cache_chrome_hybridorcache_chrome_hybrid_mem - A HTTP request type for caching.
- Http
Response - A basic generic type that represents an HTTP response.
- Http
Response Like cache_chrome_hybridorcache_chrome_hybrid_mem - A HTTP response type for caching.
- Json
Response openai - The json response from OpenAI.
- OPEN_
RESTY_ FORBIDDEN - Open Resty forbidden.
- Page
Response - The response of a web page.
Enums§
- Basic
Cache Policy - Basic cache policy.
- Cache
Options - Cache options to use for the request.
- Fetch
Page Result decentralizedandheaders - Fetch a page with the headers returned.
- Handler
control - determine action
- Header
Source - Accepts different header types (for flexibility).
- Http
Version - Represents an HTTP version
Statics§
- IGNORE_
CONTENT_ TYPES - Ignore the content types.
Functions§
- cache_
auth_ token - Cache auth token.
- cache_
chrome_ response chromeand (cache_chrome_hybridorcache_chrome_hybrid_mem) - Cache the chrome response
- cache_
skip_ browser - Check if cache options indicate browser should be skipped when cached.
- chrome_
fulfill_ headers_ from_ reqwest chrome - Fullfil the headers.
- chunk_
idle_ timeout - Per-chunk idle timeout for body streaming. If no data arrives within this duration, the stream is terminated and any partial data collected so far is returned. Prevents tarpitting / slow-drip antibot attacks. Set via SPIDER_CHUNK_IDLE_TIMEOUT_SECS (default: 30 seconds, 0 = disabled).
- clean_
html openai_slim_fit - Default cleaner used by the engine.
- clean_
html_ base - Clean the html removing css and js (base).
- clean_
html_ full - Clean the most extra properties in the html to fit the context. Removes nav/footer, trims meta, and prunes most attributes except id/class/data-*.
- clean_
html_ raw - Clean the html removing css and js default (raw passthrough).
- clean_
html_ slim - Clean the HTML to slim-fit models. This removes base64 images and heavy nodes.
- convert_
headers cache_chrome_hybridorheadersorcookies - Convert headers to header map
- crawl_
duration_ expired - Check if the crawl duration is expired.
- create_
cache_ key cacheorcache_mem - Create the cache key.
- create_
cache_ key_ raw cacheorcache_mem - Create the cache key from string.
- create_
output_ path chrome - Get the output path of a screenshot and create any parent folders if needed.
- detect_
anti_ bot_ from_ body - Detect the anti-bot technology.
- detect_
anti_ bot_ from_ headers - Detect from headers (optimized: minimal lookups, no allocations).
- detect_
anti_ bot_ tech_ response - Detect the anti-bot used from the request.
- detect_
anti_ bot_ tech_ response_ custom - Detect the anti-bot used from the request, including custom user-supplied patterns.
- detect_
antibot_ from_ url - Detect antibot from url
- detect_
hard_ forbidden_ content - Detect if a page is forbidden and should not retry.
- detect_
open_ resty_ forbidden - Detect if openresty hard 403 is forbidden and should not retry.
- emit_
log tracing - Emit a log info event.
- emit_
log_ shutdown tracing - Emit a log info event.
- fetch_
page decentralized - Perform a network request to a resource extracting all content as text.
- fetch_
page_ and_ headers decentralizedandheaders - Perform a network request to a resource with the response headers..
- fetch_
page_ html fsandchrome - Perform a network request to a resource extracting all content as text streaming. Perform a network request to a resource extracting all content as text streaming via chrome.
- fetch_
page_ html_ chrome chrome - Perform a network request to a resource extracting all content as text streaming via chrome.
- fetch_
page_ html_ chrome_ base chrome - Perform a network request to a resource extracting all content as text streaming via chrome.
- fetch_
page_ html_ chrome_ seeded chrome - Perform a network request to a resource extracting all content as text streaming via chrome seeded.
- fetch_
page_ html_ raw - Perform a network request to a resource extracting all content streaming.
- fetch_
page_ html_ raw_ cached - Perform a network request to a resource and return a cached response immediately when available.
- fetch_
page_ html_ raw_ conditional etag_cache - Perform a conditional network request using cached ETag / Last-Modified validators.
- fetch_
page_ html_ raw_ only_ html - Perform a network request to a resource extracting all content streaming.
- fetch_
page_ html_ spider_ cloud spider_cloud - Fetch a single page via the spider.cloud REST API.
- fetch_
page_ html_ with_ fallback spider_cloud - Fetch with spider.cloud fallback.
- flip_
http_ https - Flip http -> https protocols.
- gemini_
request geminiandcache_gemini - Perform a request to Gemini Chat with caching.
- gemini_
request_ base gemini - Perform a request to Gemini Chat.
- get_
cached_ url cacheorcache_memorchrome_remote_cache - Perform a network request to a resource extracting all content as text streaming via chrome.
- get_
cached_ url_ base cacheorcache_mem - Perform a network request to a resource extracting all content as text streaming via chrome.
- get_
cookies cookies - The response cookies mapped. This does nothing without the cookies feature flag enabled.
- get_
final_ redirect chrome - Get the base redirect for the website.
- get_
last_ redirect chrome - Check if url matches the last item in a redirect chain for chrome CDP
- get_
last_ segment - Get the last segment path.
- get_
semaphore balance - Return the semaphore that should be used. Takes the worse of CPU and process memory pressure to drive throttling.
- handle_
ai_ data openai - Extract to JsonResponse struct. This does nothing without ‘openai’ feature flag.
- handle_
extra_ ai_ data openai - Handle extra OpenAI data used. This does nothing without ‘openai’ feature flag.
- handle_
gemini_ credits gemini - Handle the Gemini credits used.
- handle_
openai_ credits openai - Handle the OpenAI credits used. This does nothing without ‘openai’ feature flag.
- handle_
response_ bytes - Handle the response bytes
- handle_
response_ bytes_ writer - Handle the response bytes writing links while crawling
- is_
cacheable_ body_ empty - Returns true if the body should NOT be cached (empty, near-empty, or known-bad HTML).
- is_
html_ content_ check - Check if the content is HTML.
- log
- Log to console if configuration verbose.
- networking_
capable - Determine if networking is capable for a URL.
- openai_
request openaiandcache_openai - Perform a request to OpenAI Chat. This does nothing without the ‘openai’ flag enabled.
- openai_
request_ base openai - Perform a request to OpenAI Chat. This does nothing without the ‘openai’ flag enabled.
- page_
wait chrome - Wait for page events.
- pause
control - Pause a target website running crawl. The crawl_id is prepended directly to the domain and required if set. ex: d22323edsd-https://mydomain.com
- perform_
chrome_ http_ request chrome - Perform a chrome http request.
- perform_
screenshot chrome - Perform a screenshot shortcut.
- perform_
smart_ mouse_ movement real_browserandchrome - Generate random mouse movement. This does nothing without the ‘real_browser’ flag enabled.
- prepare_
url - Prepare the url for parsing if it fails. Use this method if the url does not start with http or https.
- put_
hybrid_ cache cache_chrome_hybridorcache_chrome_hybrid_mem - Store the page to cache to be re-used across HTTP request.
- reset
control - Reset a target website crawl. The crawl_id is prepended directly to the domain and required if set. ex: d22323edsd-https://mydomain.com
- resume
control - Resume a target website crawl. The crawl_id is prepended directly to the domain and required if set. ex: d22323edsd-https://mydomain.com
- run_
navigate_ or_ content_ set_ core chromeand non-chrome_remote_cache - Core logic: either set document content or navigate.
- run_
openai_ request chromeandopenai - Use OpenAI to extend the crawl. This does nothing without ‘openai’ feature flag.
- shutdown
control - Shutdown a target website crawl. The crawl_id is prepended directly to the domain and required if set. ex: d22323edsd-https://mydomain.com
- split_
hashset_ round_ robin - Consumes
setand returns (left, right), whereleftare items matchingpred. - strip_
www - Strip the
www.prefix from a URL’s host, if present. ReturnsNoneif the URL has nowww.prefix. Example:https://www.docs.github.com/foo→https://docs.github.com/foo - wait_
for_ dom chrome - wait for dom to finish updating target selector
- wait_
for_ event chrome - wait for event with timeout
- wait_
for_ selector chrome - wait for a selector