Struct spider :: website :: Website Copy item path

on_link_find_callback: Option<fn(_: CaseInsensitiveString, _: Option<String>) -> (CaseInsensitiveString, Option<String>)>

The callback when a link is found.

Implementations§

impl Website

pub fn new(url: &str) -> Self

Initialize Website object with a start link to crawl.

pub fn set_url(&mut self, url: &str) -> &mut Self

Set the url of the website to re-use configuration and data.

pub fn is_allowed( &self, link: &CaseInsensitiveString, blacklist_url: &Box<Vec<CompactString>> ) -> bool

return true if URL:

is not already crawled
is not blacklisted
is not forbidden in robot.txt file (if parameter is defined)

pub fn is_allowed_default( &self, link: &CompactString, blacklist_url: &Box<Vec<CompactString>> ) -> bool

return true if URL:

is not blacklisted
is not forbidden in robot.txt file (if parameter is defined)

pub fn is_allowed_robots(&self, link: &str) -> bool

return true if URL:

is not forbidden in robot.txt file (if parameter is defined)

pub fn size(&self) -> usize

Amount of pages crawled.

pub fn drain_links(&mut self) -> Drain<'_, CaseInsensitiveString>

Drain the links visited.

pub fn drain_extra_links(&mut self) -> Drain<'_, CaseInsensitiveString>

Drain the extra links used for things like the sitemap.

pub fn set_extra_links( &mut self, extra_links: HashSet<CaseInsensitiveString> ) -> &HashSet<CaseInsensitiveString>

Set extra links to crawl. This could be used in conjuntion with ‘website.persist_links’ to extend the crawl on the next run.

pub fn clear(&mut self)

Clear all pages and links stored.

pub fn get_client(&self) -> &Option<Client>

Get the HTTP request client. The client is set after the crawl has started.

pub fn get_pages(&self) -> Option<&Box<Vec<Page>>>

Page getter.

pub fn get_links(&self) -> &HashSet<CaseInsensitiveString>

Links visited getter.

pub fn get_url_parsed(&self) -> &Option<Box<Url>>

Domain parsed url getter.

pub fn get_url(&self) -> &CaseInsensitiveString

Domain name getter.

pub fn get_status(&self) -> &CrawlStatus

Get the active crawl status.

pub fn persist_links(&mut self) -> &mut Self

Set the crawl status to persist between the run. Example crawling a sitemap and all links after - website.crawl_sitemap().await.persist_links().crawl().await

pub fn get_absolute_path(&self, domain: Option<&str>) -> Option<Url>

Absolute base url of crawl.

pub fn stop(&mut self)

Stop all crawls for the website.

pub async fn configure_robots_parser(&mut self, client: Client) -> Client

configure the robots parser on initial crawl attempt and run.

pub fn set_http_client(&mut self, client: Client) -> &Option<Client>

Set the HTTP client to use directly. This is helpful if you manually call ‘website.configure_http_client’ before the crawl.

pub fn configure_http_client(&mut self) -> Client

Configure http client.

pub async fn crawl(&mut self)

Start to crawl website with async concurrency.

pub async fn crawl_sitemap(&mut self)

Start to crawl website with async concurrency using the sitemap. This does not page forward into the request. This does nothing without the sitemap flag enabled.

pub async fn crawl_smart(&mut self)

Start to crawl website with async concurrency smart. Use HTTP first and JavaScript Rendering as needed. This has no effect without the smart flag enabled.

pub async fn crawl_raw(&mut self)

Start to crawl website with async concurrency using the base raw functionality. Useful when using the chrome feature and defaulting to the basic implementation.

pub async fn scrape(&mut self)

Start to scrape/download website with async concurrency.

pub async fn scrape_raw(&mut self)

Start to crawl website with async concurrency using the base raw functionality. Useful when using the “chrome” feature and defaulting to the basic implementation.

pub async fn sitemap_crawl( &mut self, _client: &Client, _handle: &Option<Arc<AtomicI8>>, _scrape: bool )

Sitemap crawl entire lists. Note: this method does not re-crawl the links of the pages found on the sitemap. This does nothing without the sitemap flag.

pub fn with_respect_robots_txt(&mut self, respect_robots_txt: bool) -> &mut Self

Respect robots.txt file.

pub fn with_subdomains(&mut self, subdomains: bool) -> &mut Self

Include subdomains detection.

pub fn with_tld(&mut self, tld: bool) -> &mut Self

Include tld detection.

pub fn with_http2_prior_knowledge( &mut self, http2_prior_knowledge: bool ) -> &mut Self

Only use HTTP/2.

pub fn with_delay(&mut self, delay: u64) -> &mut Self

Delay between request as ms.

pub fn with_request_timeout( &mut self, request_timeout: Option<Duration> ) -> &mut Self

Max time to wait for request.

pub fn with_danger_accept_invalid_certs( &mut self, accept_invalid_certs: bool ) -> &mut Self

Dangerously accept invalid certificates - this should be used as a last resort.

pub fn with_user_agent(&mut self, user_agent: Option<&str>) -> &mut Self

Add user agent to request.

pub fn with_sitemap(&mut self, _sitemap_url: Option<&str>) -> &mut Self

Add user agent to request. This does nothing without the sitemap flag enabled.

pub fn with_proxies(&mut self, proxies: Option<Vec<String>>) -> &mut Self

Use proxies for request.

pub fn with_crawl_id(&mut self, _crawl_id: String) -> &mut Self

Set a crawl ID to use for tracking crawls. This does nothing without the control flag enabled.

pub fn with_blacklist_url<T>( &mut self, blacklist_url: Option<Vec<T>> ) -> &mut Self
where Vec<CompactString>: From<Vec<T>>,

Add blacklist urls to ignore.

pub fn with_headers(&mut self, headers: Option<HeaderMap>) -> &mut Self

Set HTTP headers for request using reqwest::header::HeaderMap.

pub fn with_budget(&mut self, budget: Option<HashMap<&str, u32>>) -> &mut Self

Set a crawl budget per path with levels support /a/b/c or for all paths with “*”. This does nothing without the budget flag enabled.

pub fn set_crawl_budget( &mut self, _budget: Option<HashMap<CaseInsensitiveString, u32>> )

Set the crawl budget directly. This does nothing without the budget flag enabled.

pub fn with_depth(&mut self, depth: usize) -> &mut Self

Set a crawl depth limit. If the value is 0 there is no limit. This does nothing without the feat flag budget enabled.

pub fn with_external_domains<'a, 'b>( &mut self, external_domains: Option<impl Iterator<Item = String> + 'a> ) -> &mut Self

Group external domains to treat the crawl as one. If None is passed this will clear all prior domains.

pub fn with_on_link_find_callback( &mut self, on_link_find_callback: Option<fn(_: CaseInsensitiveString, _: Option<String>) -> (CaseInsensitiveString, Option<String>)> ) -> &mut Self

Perform a callback to run on each link find.

pub fn with_cookies(&mut self, cookie_str: &str) -> &mut Self

Cookie string to use in request. This does nothing without the cookies flag enabled.

pub fn with_cron(&mut self, cron_str: &str, cron_type: CronType) -> &mut Self

Setup cron jobs to run. This does nothing without the cron flag enabled.

pub fn with_locale(&mut self, locale: Option<String>) -> &mut Self

Overrides default host system locale with the specified one. This does nothing without the chrome flag enabled.

pub fn with_stealth(&mut self, stealth_mode: bool) -> &mut Self

Use stealth mode for the request. This does nothing without the chrome flag enabled.

pub fn with_openai(&mut self, openai_configs: Option<GPTConfigs>) -> &mut Self

Use OpenAI to get dynamic javascript to drive the browser. This does nothing without the openai flag enabled.

pub fn with_caching(&mut self, cache: bool) -> &mut Self

Cache the page following HTTP rules. This method does nothing if the cache feature is not enabled.

pub fn with_fingerprint(&mut self, fingerprint: bool) -> &mut Self

Setup custom fingerprinting for chrome. This method does nothing if the chrome feature is not enabled.

pub fn with_viewport(&mut self, viewport: Option<Viewport>) -> &mut Self

Configures the viewport of the browser, which defaults to 800x600. This method does nothing if the chrome feature is not enabled.

pub fn with_wait_for_idle_network( &mut self, wait_for_idle_network: Option<WaitForIdleNetwork> ) -> &mut Self

Wait for idle network request. This method does nothing if the chrome feature is not enabled.

pub fn with_wait_for_selector( &mut self, wait_for_selector: Option<WaitForSelector> ) -> &mut Self

Wait for a CSS query selector. This method does nothing if the chrome feature is not enabled.

pub fn with_wait_for_delay( &mut self, wait_for_delay: Option<WaitForDelay> ) -> &mut Self

Wait for a delay. Should only be used for testing. This method does nothing if the chrome feature is not enabled.

pub fn with_redirect_limit(&mut self, redirect_limit: usize) -> &mut Self

Set the max redirects allowed for request.

pub fn with_redirect_policy(&mut self, policy: RedirectPolicy) -> &mut Self

Set the redirect policy to use, either Strict or Loose by default.

pub fn with_chrome_intercept( &mut self, chrome_intercept: bool, block_images: bool ) -> &mut Self

Use request intercept for the request to only allow content that matches the host. If the content is from a 3rd party it needs to be part of our include list. This method does nothing if the chrome_intercept flag is not enabled.

pub fn with_full_resources(&mut self, full_resources: bool) -> &mut Self

Determine whether to collect all the resources found on pages.

pub fn with_ignore_sitemap(&mut self, ignore_sitemap: bool) -> &mut Self

Ignore the sitemap when crawling. This method does nothing if the sitemap flag is not enabled.

pub fn with_timezone_id(&mut self, timezone_id: Option<String>) -> &mut Self

Overrides default host system timezone with the specified one. This does nothing without the chrome flag enabled.

pub fn with_evaluate_on_new_document( &mut self, evaluate_on_new_document: Option<Box<String>> ) -> &mut Self

Set a custom script to evaluate on new document creation. This does nothing without the feat flag chrome enabled.

pub fn with_limit(&mut self, limit: u32) -> &mut Self

Set a crawl page limit. If the value is 0 there is no limit. This does nothing without the feat flag budget enabled.

pub fn with_screenshot( &mut self, screenshot_config: Option<ScreenShotConfig> ) -> &mut Self

Set the chrome screenshot configuration. This does nothing without the chrome flag enabled.

pub fn with_auth_challenge_response( &mut self, auth_challenge_response: Option<AuthChallengeResponse> ) -> &mut Self

Set the authentiation challenge response. This does nothing without the feat flag chrome enabled.

pub fn with_chrome_connection( &mut self, chrome_connection_url: Option<String> ) -> &mut Self

Set the connection url for the chrome instance. This method does nothing if the chrome is not enabled.

pub fn with_config(&mut self, config: Configuration) -> &mut Self

Set the configuration for the website directly.

pub fn build(&self) -> Result<Self, Error>

Build the website configuration when using with_builder.

Sets up a subscription to receive concurrent data. This will panic if it is larger than usize::MAX / 2. Set the value to 0 to use the semaphore permits. If the subscription is going to block or use async methods, make sure to spawn a task to avoid losing messages. This does nothing unless the sync flag is enabled.

§Examples

Subscribe and receive messages using an async tokio environment:

use spider::{tokio, website::Website};

#[tokio::main]
async fn main() {
    let mut website = Website::new("http://example.com");
    let mut rx = website.subscribe(0).unwrap();

    tokio::spawn(async move {
        while let Ok(page) = rx.recv().await {
            tokio::spawn(async move {
                // Process the received page.
                // If performing non-blocking tasks or managing a high subscription count, configure accordingly.
            });
        }
    });

    website.crawl().await;
}

pub fn queue(&mut self, capacity: usize) -> Option<Sender<String>>

Get a sender for queueing extra links mid crawl. This does nothing unless the sync flag is enabled.

pub fn unsubscribe(&mut self)

Remove subscriptions for data. This is useful for auto droping subscriptions that are running on another thread. This does nothing without the sync flag enabled.

pub fn subscribe_guard(&mut self) -> Option<ChannelGuard>

Setup subscription counter to track concurrent operation completions. This helps keep a chrome instance active until all operations are completed from all threads to safely take screenshots and other actions. Make sure to call inc if you take a guard. Without calling inc in the subscription receiver the crawl will stay in a infinite loop. This does nothing without the sync flag enabled. You also need to use the ‘chrome_store_page’ to keep the page alive between request.

§Example

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("http://example.com");
    let mut rx2 = website.subscribe(18).unwrap();
    let mut rxg = website.subscribe_guard().unwrap();

    tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!("📸 - {:?}", page.get_url());
            page
                .screenshot(
                    true,
                    true,
                    spider::configuration::CaptureScreenshotFormat::Png,
                    Some(75),
                    None::<std::path::PathBuf>,
                    None,
                )
                .await;
            rxg.inc();
        }
    });
    website.crawl().await;
}

Trait Implementations§

impl Clone for Website

fn clone(&self) -> Website

Returns a copy of the value. Read more

1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more

impl Debug for Website

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

impl Default for Website

fn default() -> Website

Returns the “default value” for a type. Read more

Auto Trait Implementations§

impl !UnwindSafe for Website

Blanket Implementations§

impl<T> Any for T
where T: 'static + ?Sized,

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

impl<T> Borrow<T> for T
where T: ?Sized,

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

impl<T> BorrowMut<T> for T
where T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

impl<T> From<T> for T

fn from(t: T) -> T

Returns the argument unchanged.

impl<T> Instrument for T

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more

impl<T, U> Into for T
where U: From<T>,

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

impl<T> ToOwned for T
where T: Clone,

type Owned = T

The resulting type after obtaining ownership.

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more

impl<T, U> TryFrom for T
where U: Into<T>,

type Error = Infallible

The type returned in the event of a conversion error.

fn try_from(value: U) -> Result<T, <T as TryFrom>::Error>

Performs the conversion.

impl<T, U> TryInto for T
where U: TryFrom<T>,

type Error = >::Error

The type returned in the event of a conversion error.

fn try_into(self) -> Result<U, >::Error>

Performs the conversion.

impl<T> WithSubscriber for T

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more