Skip to main content

Page

Struct Page 

Source
pub struct Page {
Show 31 fields pub headers: Option<HeaderMap>, pub remote_addr: Option<SocketAddr>, pub cookies: Option<HeaderMap>, pub status_code: StatusCode, pub error_status: Option<String>, pub links: HashSet<CaseInsensitiveString>, pub external_domains_caseless: Arc<HashSet<CaseInsensitiveString>>, pub final_redirect_destination: Option<String>, pub screenshot_bytes: Option<Vec<u8>>, pub openai_credits_used: Option<Vec<OpenAIUsage>>, pub extra_ai_data: Option<Vec<AIResults>>, pub gemini_credits_used: Option<Vec<GeminiUsage>>, pub extra_gemini_data: Option<Vec<AIResults>>, pub remote_multimodal_usage: Option<Vec<AutomationUsage>>, pub extra_remote_multimodal_data: Option<Vec<AutomationResults>>, pub spawn_pages: Option<Vec<String>>, pub content_map: Option<HashMap<String, Bytes>>, pub page_links: Option<Box<HashSet<CaseInsensitiveString>>>, pub should_retry: bool, pub waf_check: bool, pub bytes_transferred: Option<f64>, pub blocked_crawl: bool, pub signature: Option<u64>, pub response_map: Option<HashMap<String, f64>>, pub request_map: Option<HashMap<String, f64>>, pub anti_bot_tech: AntiBotTech, pub metadata: Option<Box<Metadata>>, pub content_truncated: bool, pub proxy_configured: bool, pub binary_file: bool, pub backend_source: Option<CompactString>, /* private fields */
}
Available on crate feature decentralized only.
Expand description

Represent a page visited.

Fields§

§headers: Option<HeaderMap>

The headers of the page request response.

§remote_addr: Option<SocketAddr>
Available on crate feature remote_addr only.

The remote address of the page.

§cookies: Option<HeaderMap>
Available on crate feature cookies only.

The cookies of the page request response.

§status_code: StatusCode

The status code of the page request.

§error_status: Option<String>

The error of the request if any.

§links: HashSet<CaseInsensitiveString>

The current links for the page.

§external_domains_caseless: Arc<HashSet<CaseInsensitiveString>>

The external urls to group with the domain.

§final_redirect_destination: Option<String>

The final destination of the page if redirects were performed [Unused].

§screenshot_bytes: Option<Vec<u8>>
Available on crate feature chrome only.

The screenshot bytes of the page.

§openai_credits_used: Option<Vec<OpenAIUsage>>
Available on crate feature openai only.

The credits used from OpenAI in order.

§extra_ai_data: Option<Vec<AIResults>>
Available on crate feature openai only.

The extra data from the AI, example extracting data etc…

§gemini_credits_used: Option<Vec<GeminiUsage>>
Available on crate feature gemini only.

The credits used from Gemini in order.

§extra_gemini_data: Option<Vec<AIResults>>
Available on crate feature gemini only.

The extra data from the Gemini AI.

§remote_multimodal_usage: Option<Vec<AutomationUsage>>

The usage from remote multimodal automation (extraction, etc.). Works with both Chrome and HTTP-only crawls.

§extra_remote_multimodal_data: Option<Vec<AutomationResults>>

The extra data from the remote multimodal automation (extraction results, etc.). Works with both Chrome and HTTP-only crawls.

§spawn_pages: Option<Vec<String>>

URLs requested by automation to spawn as additional pages.

§content_map: Option<HashMap<String, Bytes>>
Available on crate feature spider_cloud only.

Additional content keyed by return format (e.g. "markdown", "text"). Populated when multiple formats are requested via SpiderCloudConfig::with_return_formats.

§page_links: Option<Box<HashSet<CaseInsensitiveString>>>

The links found on the page. Unused until we can structure the buffers to match.

§should_retry: bool

The request should retry.

§waf_check: bool

A WAF was found on the page.

§bytes_transferred: Option<f64>

The total byte transferred for the page. Mainly used for chrome events.

§blocked_crawl: bool

The page was blocked from crawling usual from using website::on_should_crawl_callback.

§signature: Option<u64>

The signature of the page to de-duplicate content.

§response_map: Option<HashMap<String, f64>>
Available on crate feature chrome only.

All of the response events mapped with the amount of bytes used.

§request_map: Option<HashMap<String, f64>>
Available on crate feature chrome only.

All of the request events mapped with the time period of the event sent.

§anti_bot_tech: AntiBotTech

The anti-bot tech used.

§metadata: Option<Box<Metadata>>

Page metadata.

§content_truncated: bool

Whether the response content was truncated due to a stream error, chunk idle timeout, or Content-Length mismatch.

§proxy_configured: bool

Whether a proxy was configured for this request. When true, 401 responses are retried (proxy rotation may fix auth).

§binary_file: bool

Whether the content is a binary file (image, PDF, etc.). Set once when HTML bytes are first available so the flag remains accurate after content is spooled to disk.

§backend_source: Option<CompactString>
Available on crate feature parallel_backends only.

Identifies which backend produced this page (e.g. “primary”, “cdp”, “servo”). None when parallel backends are not active.

Implementations§

Source§

impl Page

Source

pub fn needs_retry(&self) -> bool

Whether the page needs a retry based on should_retry, a retryable status code, a truncated response (upstream stream ended prematurely), or a proxy-retryable 401 (when proxy_configured is set, proxy rotation may resolve the auth failure).

Source

pub async fn new_page(url: &str, client: &Client) -> Self

Instantiate a new page and gather the html repro of standard fetch_page_html.

Source

pub async fn new_page_with_cache( url: &str, client: &Client, cache_options: Option<CacheOptions>, cache_policy: &Option<BasicCachePolicy>, cache_namespace: Option<&str>, ) -> Self

Instantiate a new page using cache options when available.

Source

pub fn new_webdriver(url: &str, html: String, status_code: StatusCode) -> Self

Create a new page from WebDriver content.

Source

pub async fn new_page_webdriver( url: &str, driver: &Arc<WebDriver>, timeout: Option<Duration>, ) -> Self

Create a new page from WebDriver with full response.

Source

pub async fn new_page_webdriver_full( url: &str, driver: &Arc<WebDriver>, timeout: Option<Duration>, wait_for: &Option<WaitFor>, execution_scripts: &Option<ExecutionScripts>, automation_scripts: &Option<AutomationScripts>, ) -> Self

Create a new page from WebDriver with full response and automation support.

Source

pub async fn new_page_streaming<A: PartialEq + Eq + Sync + Send + Clone + Default + Hash + From<String> + for<'a> From<&'a str>>( url: &str, client: &Client, only_html: bool, selectors: &mut RelativeSelectors, external_domains_caseless: &Arc<HashSet<CaseInsensitiveString>>, r_settings: &PageLinkBuildSettings, map: &mut HashSet<A>, ssg_map: Option<&mut HashSet<A>>, prior_domain: &Option<Box<Url>>, domain_parsed: &mut Option<Box<Url>>, links_pages: &mut Option<HashSet<A>>, ) -> Self

New page with rewriter

Source

pub async fn new_page_only_html(url: &str, client: &Client) -> Self

Instantiate a new page and gather the html repro of standard fetch_page_html only gathering resources to crawl.

Source

pub async fn new_page_streaming_from_bytes<A: PartialEq + Eq + Sync + Send + Clone + Default + Hash + From<String> + for<'a> From<&'a str>>( url: &str, input_bytes: &[u8], selectors: &mut RelativeSelectors, external_domains_caseless: &Arc<HashSet<CaseInsensitiveString>>, r_settings: &PageLinkBuildSettings, map: &mut HashSet<A>, ssg_map: Option<&mut HashSet<A>>, prior_domain: &Option<Box<Url>>, domain_parsed: &mut Option<Box<Url>>, links_pages: &mut Option<HashSet<A>>, ) -> Self

Instantiate a new page and gather the links from input bytes.

Source

pub async fn new(url: &str, client: &Client) -> Self

Instantiate a new page and gather the headers and links.

Instantiate a new page and gather the links.

Source

pub async fn screenshot( &self, _full_page: bool, _omit_background: bool, _format: CaptureScreenshotFormat, _quality: Option<i64>, _output_path: Option<impl AsRef<Path>>, _clip: Option<ClipViewport>, ) -> Vec<u8>

Take a screenshot of the page. If the output path is set to None the screenshot will not be saved. The feature flag chrome_store_page is required.

Source

pub fn get_chrome_page(&self) -> Option<&Page>

Available on crate feature chrome only.

Get the chrome page used. The feature flag chrome is required.

Source

pub async fn close_page(&mut self)

Available on crate feature chrome only.

Close the chrome page used. Useful when storing the page for subscription usage. The feature flag chrome_store_page is required.

Source

pub fn is_empty(&self) -> bool

Page request is empty. On chrome an empty page has bare html markup. When the balance feature is active, a page whose HTML has been spooled to disk is not considered empty.

Source

pub fn get_timeout(&self) -> Option<Duration>

Available on crate feature headers only.

Get the timeout required for rate limiting. The max duration is 30 seconds for delay respecting. Requires the feature flag headers.

Source

pub fn set_external( &mut self, external_domains_caseless: Arc<HashSet<CaseInsensitiveString>>, )

Set the external domains to treat as one

Source

pub fn set_html_bytes(&mut self, html: Option<Vec<u8>>)

Set the html directly of the page

Source

pub fn is_html_on_disk(&self) -> bool

Available on non-crate feature balance or crate feature decentralized only.

Whether this page’s HTML currently lives on disk rather than in memory. Always returns false when the balance feature is not enabled or the decentralized feature is active.

Source

pub fn is_binary_spool_aware(&self) -> bool

Check if this page contains binary content, even when the HTML is spooled to disk.

Zero disk I/O: binary_file is snapshotted at spool time (before bytes leave memory). For in-memory pages the magic-number check runs on the existing buffer. Spooled pages rely solely on the pre-cached flag — no disk peek needed.

Source

pub fn stream_html_bytes<F>(&self, chunk_size: usize, cb: F) -> usize
where F: FnMut(&[u8]) -> bool,

Available on non-crate feature balance or crate feature decentralized only.

Stream the HTML content in fixed-size chunks to a caller-supplied callback. Works the same as stream_html_bytes but is available without the balance feature — it simply chunks the in-memory HTML.

Source

pub async fn stream_html_bytes_async<F>( &self, chunk_size: usize, cb: F, ) -> usize
where F: FnMut(&[u8]) -> bool,

Available on non-crate feature balance or crate feature decentralized only.

Async version of stream_html_bytes. Without the balance feature this simply chunks the in-memory HTML (no disk path exists).

Source

pub async fn get_html_async(&self) -> String

Available on non-crate feature balance or crate feature decentralized only.

Async version of get_html. Without balance this delegates to the sync version (no disk path).

Source

pub fn set_url_parsed_direct(&mut self)

Set the url directly parsed url of the page.

Source

pub fn set_url_parsed_direct_empty(&mut self)

Set the url directly parsed url of the page. Useful for transforming the content and rewriting the url.

Source

pub fn get_url_parsed(&self) -> &Option<Url>

Parsed URL getter for page.

Source

pub fn get_url_parsed_ref(&self) -> &Option<Url>

Parsed URL getter for page.

Source

pub fn take_url(&mut self) -> Option<Url>

Take the parsed url.

Source

pub fn get_url(&self) -> &str

URL getter for page.

Source

pub fn get_bytes(&self) -> Option<&[u8]>

Html getter for bytes on the page.

Returns None when HTML is spooled to disk. Use [get_html], [get_html_async], or [stream_html_bytes] for disk-aware access.

Source

pub fn get_html(&self) -> String

Html getter for bytes on the page as string.

When the balance feature is active and the HTML was spooled to disk, this transparently reads from the temporary file and returns the content. The spool file is not deleted here (use ensure_html_loaded to reload + delete).

Source

pub fn get_content(&self) -> String

Content getter — returns the page body as a string.

This is an alias for get_html that works with any return format (HTML, markdown, text, etc.) set via SpiderCloudConfig::with_return_format or transformed locally with spider_transformations.

Source

pub fn get_html_cow(&self) -> Cow<'_, str>

Html getter that avoids allocation when the content is already valid UTF-8. Returns Cow::Borrowed for UTF-8 content (common case), Cow::Owned when encoding conversion is needed or content is loaded from a disk spool.

Source

pub fn get_html_bytes_u8(&self) -> &[u8]

Html getter for page to u8.

Source

pub fn get_content_bytes(&self) -> &[u8]

Content getter as raw bytes — alias for get_html_bytes_u8.

Works with any return format (HTML, markdown, text, etc.).

Source

pub fn get_content_for(&self, format: &str) -> Option<String>

Available on crate feature spider_cloud only.

Get content for a specific return format from a multi-format response.

Returns None if multi-format was not requested or the format is not present. Use with_return_formats on SpiderCloudConfig to request multiple formats.

Source

pub fn get_content_bytes_for(&self, format: &str) -> Option<&[u8]>

Available on crate feature spider_cloud only.

Get content for a specific return format as raw bytes.

Returns None if multi-format was not requested or the format is not present.

Source

pub fn has_content_map(&self) -> bool

Available on crate feature spider_cloud only.

Check if this page has multi-format content available.

Source

pub fn quality_score(&self) -> u16

Available on crate feature parallel_backends only.

Compute an HTML quality score (0–100) for this page.

Uses status code, content length, structural HTML checks, and anti-bot detection to score the response.

Source

pub fn get_responses(&self) -> &Option<HashMap<String, f64>>

Available on crate feature chrome only.

Get the response events mapped.

Source

pub fn get_metadata(&self) -> &Option<Box<Metadata>>

Get the metadata found on the page.

Source

pub fn get_request(&self) -> &Option<HashMap<String, f64>>

Available on crate feature chrome only.

Get the response events mapped.

Source

pub fn get_html_encoded(&self, label: &str) -> String

Available on crate feature encoding only.

Html getter for getting the content with proper encoding. Pass in a proper encoding label like SHIFT_JIS. This fallsback to get_html without the encoding flag enabled.

Source

pub fn set_duration_elapsed(&mut self, scraped_at: Option<Instant>)

Available on crate feature time only.

Set the elapsed duration of the page since scraped from duration.

Source

pub fn set_duration_elapsed_from_duration(&mut self, elapsed: Option<Duration>)

Available on crate feature time only.

Set the elapsed duration of the page since scraped from duration.

Source

pub fn get_duration_elapsed(&self) -> Duration

Available on crate feature time only.

Get the elapsed duration of the page since scraped.

Find the links as a stream using string resource validation for XML files.

Find the links as a stream using string resource validation

Find all href links and return them using CSS selectors.

Find all href links and return them using CSS selectors gathering all resources.

Trait Implementations§

Source§

impl Clone for Page

Source§

fn clone(&self) -> Page

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for Page

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl Default for Page

Source§

fn default() -> Page

Returns the “default value” for a type. Read more
Source§

impl PageChromeExt for Page

Available on crate feature chrome only.
Source§

fn chrome_page(&self) -> Option<&Page>

The underlying Chrome DevTools Protocol page handle, if available.
Source§

fn screenshot_bytes(&self) -> Option<&[u8]>

A screenshot of the page as raw bytes, if captured.
Source§

impl PageData for Page

Source§

fn url(&self) -> &str

The page URL as originally requested.
Source§

fn url_final(&self) -> &str

The final URL after any redirects.
Source§

fn bytes(&self) -> Option<&[u8]>

The raw response bytes, if available.
Source§

fn html(&self) -> String

The page content decoded as a UTF-8 string.
Source§

fn html_bytes_u8(&self) -> &[u8]

The raw HTML bytes (empty slice if none).
Source§

fn status_code(&self) -> StatusCode

The HTTP status code of the response.
Source§

fn headers(&self) -> Option<&HeaderMap>

The HTTP response headers, if available.
Source§

fn is_empty(&self) -> bool

Whether the page has no meaningful content.
Source§

impl PageTimingExt for Page

Available on crate feature time only.
Source§

fn duration_elapsed(&self) -> Duration

How long since the page request started.

Auto Trait Implementations§

§

impl !Freeze for Page

§

impl RefUnwindSafe for Page

§

impl Send for Page

§

impl Sync for Page

§

impl Unpin for Page

§

impl UnsafeUnpin for Page

§

impl UnwindSafe for Page

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> DynClone for T
where T: Clone,

Source§

fn __clone_box(&self, _: Private) -> *mut ()

Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> PolicyExt for T
where T: ?Sized,

Source§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow only if self and other return Action::Follow. Read more
Source§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow if either self or other returns Action::Follow. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more