pub struct Page {Show 14 fields
pub headers: Option<HeaderMap>,
pub cookies: Option<HeaderMap>,
pub status_code: StatusCode,
pub error_status: Option<String>,
pub external_domains_caseless: Box<HashSet<CaseInsensitiveString>>,
pub final_redirect_destination: Option<String>,
pub page_links: Option<Box<HashSet<CaseInsensitiveString>>>,
pub should_retry: bool,
pub waf_check: bool,
pub bytes_transferred: Option<f64>,
pub blocked_crawl: bool,
pub signature: Option<u64>,
pub anti_bot_tech: AntiBotTech,
pub metadata: Option<Box<Metadata>>,
/* private fields */
}Expand description
Represent a page visited.
Fields§
§headers: Option<HeaderMap>The headers of the page request response.
The cookies of the page request response.
status_code: StatusCodeThe status code of the page request.
error_status: Option<String>The error of the request if any.
external_domains_caseless: Box<HashSet<CaseInsensitiveString>>The external urls to group with the domain
final_redirect_destination: Option<String>The final destination of the page if redirects were performed [Not implemented in the chrome feature].
page_links: Option<Box<HashSet<CaseInsensitiveString>>>The links found on the page. This includes all links that have an href url.
should_retry: boolThe request should retry.
waf_check: boolA WAF was found on the page.
bytes_transferred: Option<f64>The total byte transferred for the page. Mainly used for chrome events. Inspect the content for bytes when using http instead.
blocked_crawl: boolThe page was blocked from crawling usual from using website::on_should_crawl_callback.
signature: Option<u64>The signature of the page to de-duplicate content.
anti_bot_tech: AntiBotTechThe anti-bot tech used.
metadata: Option<Box<Metadata>>Page metadata.
Implementations§
Source§impl Page
impl Page
Sourcepub async fn new_page(url: &str, client: &Client) -> Self
pub async fn new_page(url: &str, client: &Client) -> Self
Instantiate a new page and gather the html repro of standard fetch_page_html.
Sourcepub async fn new_page_streaming<A: PartialEq + Eq + Sync + Send + Clone + Default + Hash + From<String>>(
url: &str,
client: &Client,
only_html: bool,
selectors: &mut RelativeSelectors,
external_domains_caseless: &Box<HashSet<CaseInsensitiveString>>,
r_settings: &PageLinkBuildSettings,
map: &mut HashSet<A>,
ssg_map: Option<&mut HashSet<A>>,
prior_domain: &Option<Box<Url>>,
domain_parsed: &mut Option<Box<Url>>,
links_pages: &mut Option<HashSet<A>>,
) -> Self
pub async fn new_page_streaming<A: PartialEq + Eq + Sync + Send + Clone + Default + Hash + From<String>>( url: &str, client: &Client, only_html: bool, selectors: &mut RelativeSelectors, external_domains_caseless: &Box<HashSet<CaseInsensitiveString>>, r_settings: &PageLinkBuildSettings, map: &mut HashSet<A>, ssg_map: Option<&mut HashSet<A>>, prior_domain: &Option<Box<Url>>, domain_parsed: &mut Option<Box<Url>>, links_pages: &mut Option<HashSet<A>>, ) -> Self
New page with rewriter
Sourcepub async fn new_page_only_html(url: &str, client: &Client) -> Self
pub async fn new_page_only_html(url: &str, client: &Client) -> Self
Instantiate a new page and gather the html repro of standard fetch_page_html only gathering resources to crawl.
Sourcepub async fn new(url: &str, client: &Client) -> Self
pub async fn new(url: &str, client: &Client) -> Self
Instantiate a new page and gather the html.
Sourcepub async fn screenshot(
&self,
_full_page: bool,
_omit_background: bool,
_format: CaptureScreenshotFormat,
_quality: Option<i64>,
_output_path: Option<impl AsRef<Path>>,
_clip: Option<ClipViewport>,
) -> Vec<u8> ⓘ
pub async fn screenshot( &self, _full_page: bool, _omit_background: bool, _format: CaptureScreenshotFormat, _quality: Option<i64>, _output_path: Option<impl AsRef<Path>>, _clip: Option<ClipViewport>, ) -> Vec<u8> ⓘ
Take a screenshot of the page. If the output path is set to None the screenshot will not be saved.
The feature flag chrome_store_page is required.
Sourcepub fn is_empty(&self) -> bool
pub fn is_empty(&self) -> bool
Page request is empty. On chrome an empty page has bare html markup.
Sourcepub fn get_timeout(&self) -> Option<Duration>
pub fn get_timeout(&self) -> Option<Duration>
Get the timeout required for rate limiting. The max duration is 30 seconds for delay respecting. Requires the feature flag headers.
Sourcepub fn get_url_final(&self) -> &str
pub fn get_url_final(&self) -> &str
Url getter for page after redirects.
Sourcepub fn set_external(
&mut self,
external_domains_caseless: Box<HashSet<CaseInsensitiveString>>,
)
pub fn set_external( &mut self, external_domains_caseless: Box<HashSet<CaseInsensitiveString>>, )
Set the external domains to treat as one
Sourcepub fn set_html_bytes(&mut self, html: Option<Vec<u8>>)
pub fn set_html_bytes(&mut self, html: Option<Vec<u8>>)
Set the html directly of the page
Sourcepub fn set_url(&mut self, url: String)
pub fn set_url(&mut self, url: String)
Set the url directly of the page. Useful for transforming the content and rewriting the url.
Sourcepub fn set_url_parsed_direct(&mut self)
pub fn set_url_parsed_direct(&mut self)
Set the url directly parsed url of the page. Useful for transforming the content and rewriting the url.
Sourcepub fn set_url_parsed_direct_empty(&mut self)
pub fn set_url_parsed_direct_empty(&mut self)
Set the url directly parsed url of the page. Useful for transforming the content and rewriting the url.
Sourcepub fn set_url_parsed(&mut self, url_parsed: Url)
pub fn set_url_parsed(&mut self, url_parsed: Url)
Set the url directly parsed url of the page. Useful for transforming the content and rewriting the url.
Sourcepub fn get_url_parsed_ref(&self) -> &Option<Url>
pub fn get_url_parsed_ref(&self) -> &Option<Url>
Parsed URL getter for page.
Sourcepub fn get_url_parsed(&mut self) -> &Option<Url>
pub fn get_url_parsed(&mut self) -> &Option<Url>
Parsed URL getter for page.
Sourcepub fn get_html_bytes_u8(&self) -> &[u8] ⓘ
pub fn get_html_bytes_u8(&self) -> &[u8] ⓘ
Html getter for page to u8.
Sourcepub fn get_metadata(&self) -> &Option<Box<Metadata>>
pub fn get_metadata(&self) -> &Option<Box<Metadata>>
Get the metadata found on the page.
Sourcepub fn get_html_encoded(&self, label: &str) -> String
pub fn get_html_encoded(&self, label: &str) -> String
Html getter for getting the content with proper encoding. Pass in a proper encoding label like SHIFT_JIS. This fallsback to get_html without the encoding flag enabled.
Sourcepub fn set_duration_elapsed(&mut self, scraped_at: Option<Instant>)
pub fn set_duration_elapsed(&mut self, scraped_at: Option<Instant>)
Set the elasped duration of the page since scraped from duration.
Sourcepub fn set_duration_elapsed_from_duration(&mut self, elapsed: Option<Duration>)
pub fn set_duration_elapsed_from_duration(&mut self, elapsed: Option<Duration>)
Set the elasped duration of the page since scraped from duration.
Sourcepub fn get_duration_elapsed(&self) -> Duration
pub fn get_duration_elapsed(&self) -> Duration
Get the elasped duration of the page since scraped.
Sourcepub async fn links_stream_xml_links_stream_base<A: PartialEq + Eq + Sync + Send + Clone + Default + ToString + Hash + From<String>>(
&mut self,
selectors: &RelativeSelectors,
xml: &str,
map: &mut HashSet<A>,
base: &Option<Box<Url>>,
)
pub async fn links_stream_xml_links_stream_base<A: PartialEq + Eq + Sync + Send + Clone + Default + ToString + Hash + From<String>>( &mut self, selectors: &RelativeSelectors, xml: &str, map: &mut HashSet<A>, base: &Option<Box<Url>>, )
Find the links as a stream using string resource validation for XML files
Sourcepub async fn links_stream_base<A: PartialEq + Eq + Sync + Send + Clone + Default + ToString + Hash + From<String>>(
&mut self,
selectors: &RelativeSelectors,
html: &str,
base: &Option<Box<Url>>,
) -> HashSet<A>
pub async fn links_stream_base<A: PartialEq + Eq + Sync + Send + Clone + Default + ToString + Hash + From<String>>( &mut self, selectors: &RelativeSelectors, html: &str, base: &Option<Box<Url>>, ) -> HashSet<A>
Find the links as a stream using string resource validation
Sourcepub async fn links_stream_base_ssg<A: PartialEq + Eq + Sync + Send + Clone + Default + ToString + Hash + From<String>>(
&mut self,
selectors: &RelativeSelectors,
html: &str,
client: &Client,
base: &Option<Box<Url>>,
) -> HashSet<A>
pub async fn links_stream_base_ssg<A: PartialEq + Eq + Sync + Send + Clone + Default + ToString + Hash + From<String>>( &mut self, selectors: &RelativeSelectors, html: &str, client: &Client, base: &Option<Box<Url>>, ) -> HashSet<A>
Find the links as a stream using string resource validation
Sourcepub async fn links_stream_ssg<A: PartialEq + Eq + Sync + Send + Clone + Default + ToString + Hash + From<String>>(
&mut self,
selectors: &RelativeSelectors,
client: &Client,
prior_domain: &Option<Box<Url>>,
) -> HashSet<A>
pub async fn links_stream_ssg<A: PartialEq + Eq + Sync + Send + Clone + Default + ToString + Hash + From<String>>( &mut self, selectors: &RelativeSelectors, client: &Client, prior_domain: &Option<Box<Url>>, ) -> HashSet<A>
Find the links as a stream using string resource validation and parsing the script for nextjs initial SSG paths.
Sourcepub async fn links_ssg(
&mut self,
selectors: &RelativeSelectors,
client: &Client,
prior_domain: &Option<Box<Url>>,
) -> HashSet<CaseInsensitiveString>
pub async fn links_ssg( &mut self, selectors: &RelativeSelectors, client: &Client, prior_domain: &Option<Box<Url>>, ) -> HashSet<CaseInsensitiveString>
Find all href links and return them using CSS selectors.
Sourcepub async fn links_stream<A: PartialEq + Eq + Sync + Send + Clone + Default + ToString + Hash + From<String>>(
&mut self,
selectors: &RelativeSelectors,
base: &Option<Box<Url>>,
) -> HashSet<A>
pub async fn links_stream<A: PartialEq + Eq + Sync + Send + Clone + Default + ToString + Hash + From<String>>( &mut self, selectors: &RelativeSelectors, base: &Option<Box<Url>>, ) -> HashSet<A>
Find the links as a stream using string resource validation
Sourcepub async fn links_stream_full_resource<A: PartialEq + Eq + Sync + Send + Clone + Default + ToString + Hash + From<String>>(
&mut self,
selectors: &RelativeSelectors,
base: &Option<Box<Url>>,
) -> HashSet<A>
pub async fn links_stream_full_resource<A: PartialEq + Eq + Sync + Send + Clone + Default + ToString + Hash + From<String>>( &mut self, selectors: &RelativeSelectors, base: &Option<Box<Url>>, ) -> HashSet<A>
Find the links as a stream using string resource validation
Sourcepub async fn links(
&mut self,
selectors: &RelativeSelectors,
base: &Option<Box<Url>>,
) -> HashSet<CaseInsensitiveString>
pub async fn links( &mut self, selectors: &RelativeSelectors, base: &Option<Box<Url>>, ) -> HashSet<CaseInsensitiveString>
Find all href links and return them using CSS selectors.
Sourcepub async fn links_full(
&mut self,
selectors: &RelativeSelectors,
base: &Option<Box<Url>>,
) -> HashSet<CaseInsensitiveString>
pub async fn links_full( &mut self, selectors: &RelativeSelectors, base: &Option<Box<Url>>, ) -> HashSet<CaseInsensitiveString>
Find all href links and return them using CSS selectors gathering all resources.
Trait Implementations§
Auto Trait Implementations§
impl Freeze for Page
impl RefUnwindSafe for Page
impl Send for Page
impl Sync for Page
impl Unpin for Page
impl UnwindSafe for Page
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more