Skip to main content

Module utils

Module utils 

Source
Expand description

Application utils.

Modules§

abs
Absolute path domain handling.
connect
Connect layer for reqwest.
css_selectors
Generic CSS selectors.
detect_system
CPU and Memory detection to balance limitations.
header_utils
Utils to modify the HTTP header.
interner
String interner.
templates
Fragment templates.
trie
A trie struct.
validation
Validate html false positives.

Structs§

APACHE_FORBIDDEN
Apache server forbidden.
AllowedDomainTypes
Allow subdomains or tlds.
EMPTY_HTML_BASIC
Empty html.
GEMINI_SEM
Semaphore for Gemini rate limiting
HttpResponse
A basic generic type that represents an HTTP response.
OPEN_RESTY_FORBIDDEN
Open Resty forbidden.
PageResponse
The response of a web page.

Enums§

BasicCachePolicy
Basic cache policy.
CacheOptions
Cache options to use for the request.
HeaderSource
Accepts different header types (for flexibility).
HttpVersion
Represents an HTTP version

Statics§

IGNORE_CONTENT_TYPES
Ignore the content types.

Functions§

cache_auth_token
Cache auth token.
cache_skip_browser
Check if cache options indicate browser should be skipped when cached.
clean_html
Default cleaner used by the engine (non-slim build).
clean_html_base
Clean the html removing css and js (base).
clean_html_full
Clean the most extra properties in the html to fit the context. Removes nav/footer, trims meta, and prunes most attributes except id/class/data-*.
clean_html_raw
Clean the html removing css and js default (raw passthrough).
clean_html_slim
Clean the HTML to slim-fit models. This removes base64 images and heavy nodes.
convert_headers
Convert headers to header map
crawl_duration_expired
Check if the crawl duration is expired.
detect_anti_bot_from_body
Detect the anti-bot technology.
detect_anti_bot_from_headers
Detect from headers (optimized: minimal lookups, no allocations).
detect_anti_bot_tech_response
Detect the anti-bot used from the request.
detect_antibot_from_url
Detect antibot from url
detect_hard_forbidden_content
Detect if a page is forbidden and should not retry.
detect_open_resty_forbidden
Detect if openresty hard 403 is forbidden and should not retry.
emit_log
Emit a log info event.
emit_log_shutdown
Emit a log info event.
fetch_page_html
Perform a network request to a resource extracting all content as text streaming.
fetch_page_html_raw
Perform a network request to a resource extracting all content streaming.
fetch_page_html_raw_only_html
Perform a network request to a resource extracting all content streaming.
flip_http_https
Flip http -> https protocols.
gemini_request
Perform a request to Gemini. This does nothing without the ‘gemini’ flag enabled.
get_cached_url
Perform a network request to a resource extracting all content as text streaming via chrome.
get_cookies
The response cookies mapped. This does nothing without the cookies feature flag enabled.
get_last_segment
Get the last segment path.
get_semaphore
Return the semaphore that should be used.
handle_gemini_credits
Handle the Gemini credits used. This does nothing without ‘gemini’ feature flag.
handle_openai_credits
Handle the OpenAI credits used. This does nothing without ‘openai’ feature flag.
handle_response_bytes
Handle the response bytes
handle_response_bytes_writer
Handle the response bytes writing links while crawling
is_html_content_check
Check if the content is HTML.
log
Log to console if configuration verbose.
networking_capable
Determine if networking is capable for a URL.
openai_request
Perform a request to OpenAI Chat. This does nothing without the ‘openai’ flag enabled.
prepare_url
Prepare the url for parsing if it fails. Use this method if the url does not start with http or https.
put_hybrid_cache
Store the page to cache to be re-used across HTTP request.
split_hashset_round_robin
Consumes set and returns (left, right), where left are items matching pred.