Expand description
Application utils.
Modules§
- abs
- Absolute path domain handling.
- connect
- Connect layer for reqwest.
- css_
selectors - Generic CSS selectors.
- detect_
system - CPU and Memory detection to balance limitations.
- header_
utils - Utils to modify the HTTP header.
- interner
- String interner.
- trie
- A trie struct.
Structs§
- APACHE_
FORBIDDEN - Apache server forbidden.
- Allowed
Domain Types - Allow subdomains or tlds.
- Http
Response - A basic generic type that represents an HTTP response.
- OPEN_
RESTY_ FORBIDDEN - Open Resty forbidden.
- Page
Response - The response of a web page.
Enums§
- Header
Source - Accepts different header types (for flexibility).
- Http
Version - Represents an HTTP version
Statics§
- IGNORE_
CONTENT_ TYPES - Ignore the content types.
Functions§
- clean_
html - Clean the html removing css and js
- clean_
html_ raw - Clean the html removing css and js default using the scraper crate.
- clean_
html_ slim - Clean and remove all base64 images from the prompt.
- convert_
headers - Convert headers to header map
- crawl_
duration_ expired - Check if the crawl duration is expired.
- detect_
anti_ bot_ from_ body - Detect the anti-bot technology.
- detect_
anti_ bot_ from_ headers - Detect from headers.
- detect_
anti_ bot_ tech_ response - Detect the anti-bot used from the request.
- detect_
antibot_ from_ url - Detect antibot from url
- detect_
hard_ forbidden_ content - Detect if a page is forbidden and should not retry.
- emit_
log - Emit a log info event.
- emit_
log_ shutdown - Emit a log info event.
- fetch_
page_ html - Perform a network request to a resource extracting all content as text streaming.
- fetch_
page_ html_ raw - Perform a network request to a resource extracting all content streaming.
- fetch_
page_ html_ raw_ only_ html - Perform a network request to a resource extracting all content streaming.
- get_
cookies - The response cookies mapped. This does nothing without the cookies feature flag enabled.
- get_
last_ segment - Get the last segment path.
- get_
semaphore - Return the semaphore that should be used.
- handle_
openai_ credits - Handle the OpenAI credits used. This does nothing without ‘openai’ feature flag.
- handle_
response_ bytes - Handle the response bytes
- handle_
response_ bytes_ writer - Handle the response bytes writing links while crawling
- is_
html_ content_ check - Check if the content is HTML.
- log
- Log to console if configuration verbose.
- networking_
capable - Determine if networking is capable for a URL.
- openai_
request - Perform a request to OpenAI Chat. This does nothing without the ‘openai’ flag enabled.
- prepare_
url - Prepare the url for parsing if it fails. Use this method if the url does not start with http or https.
- put_
hybrid_ cache - Store the page to cache to be re-used across HTTP request.