Expand description
Application utils.
Modules§
- abs
- Absolute path domain handling.
- connect
- Connect layer for reqwest.
- css_
selectors - Generic CSS selectors.
- detect_
system - CPU and Memory detection to balance limitations.
- header_
utils - Utils to modify the HTTP header.
- interner
- String interner.
- templates
- Fragment templates.
- trie
- A trie struct.
- validation
- Validate html false positives.
Structs§
- APACHE_
FORBIDDEN - Apache server forbidden.
- Allowed
Domain Types - Allow subdomains or tlds.
- EMPTY_
HTML_ BASIC - Empty html.
- GEMINI_
SEM - Semaphore for Gemini rate limiting
- Http
Response - A basic generic type that represents an HTTP response.
- OPEN_
RESTY_ FORBIDDEN - Open Resty forbidden.
- Page
Response - The response of a web page.
Enums§
- Basic
Cache Policy - Basic cache policy.
- Cache
Options - Cache options to use for the request.
- Header
Source - Accepts different header types (for flexibility).
- Http
Version - Represents an HTTP version
Statics§
- IGNORE_
CONTENT_ TYPES - Ignore the content types.
Functions§
- cache_
auth_ token - Cache auth token.
- cache_
skip_ browser - Check if cache options indicate browser should be skipped when cached.
- clean_
html - Default cleaner used by the engine (non-slim build).
- clean_
html_ base - Clean the html removing css and js (base).
- clean_
html_ full - Clean the most extra properties in the html to fit the context. Removes nav/footer, trims meta, and prunes most attributes except id/class/data-*.
- clean_
html_ raw - Clean the html removing css and js default (raw passthrough).
- clean_
html_ slim - Clean the HTML to slim-fit models. This removes base64 images and heavy nodes.
- convert_
headers - Convert headers to header map
- crawl_
duration_ expired - Check if the crawl duration is expired.
- detect_
anti_ bot_ from_ body - Detect the anti-bot technology.
- detect_
anti_ bot_ from_ headers - Detect from headers (optimized: minimal lookups, no allocations).
- detect_
anti_ bot_ tech_ response - Detect the anti-bot used from the request.
- detect_
antibot_ from_ url - Detect antibot from url
- detect_
hard_ forbidden_ content - Detect if a page is forbidden and should not retry.
- detect_
open_ resty_ forbidden - Detect if openresty hard 403 is forbidden and should not retry.
- emit_
log - Emit a log info event.
- emit_
log_ shutdown - Emit a log info event.
- fetch_
page_ html - Perform a network request to a resource extracting all content as text streaming.
- fetch_
page_ html_ raw - Perform a network request to a resource extracting all content streaming.
- fetch_
page_ html_ raw_ only_ html - Perform a network request to a resource extracting all content streaming.
- flip_
http_ https - Flip http -> https protocols.
- gemini_
request - Perform a request to Gemini. This does nothing without the ‘gemini’ flag enabled.
- get_
cached_ url - Perform a network request to a resource extracting all content as text streaming via chrome.
- get_
cookies - The response cookies mapped. This does nothing without the cookies feature flag enabled.
- get_
last_ segment - Get the last segment path.
- get_
semaphore - Return the semaphore that should be used.
- handle_
gemini_ credits - Handle the Gemini credits used. This does nothing without ‘gemini’ feature flag.
- handle_
openai_ credits - Handle the OpenAI credits used. This does nothing without ‘openai’ feature flag.
- handle_
response_ bytes - Handle the response bytes
- handle_
response_ bytes_ writer - Handle the response bytes writing links while crawling
- is_
html_ content_ check - Check if the content is HTML.
- log
- Log to console if configuration verbose.
- networking_
capable - Determine if networking is capable for a URL.
- openai_
request - Perform a request to OpenAI Chat. This does nothing without the ‘openai’ flag enabled.
- prepare_
url - Prepare the url for parsing if it fails. Use this method if the url does not start with http or https.
- put_
hybrid_ cache - Store the page to cache to be re-used across HTTP request.
- split_
hashset_ round_ robin - Consumes
setand returns (left, right), whereleftare items matchingpred.