pub struct Website {
pub configuration: Box<Configuration>,
pub on_link_find_callback: Option<OnLinkFindCallback>,
pub on_should_crawl_callback: Option<OnShouldCrawlCallback>,
pub crawl_id: Box<String>,
pub extra_info: Option<Box<String>>,
pub cookie_jar: Arc<Jar>,
pub pb_quality_validator: Option<QualityValidator>,
/* private fields */
}Expand description
Represents a website to crawl and gather all links or page content.
use spider::website::Website;
let mut website = Website::new("http://example.com");
website.crawl();
// `Website` will be filled with links or pages when crawled. If you need pages with the resource
// call the `website.scrape` method with `website.get_pages` instead.
for link in website.get_links() {
// do something
}Fields§
§configuration: Box<Configuration>Configuration properties for website.
on_link_find_callback: Option<OnLinkFindCallback>The callback when a link is found.
on_should_crawl_callback: Option<OnShouldCrawlCallback>The callback to use if a page should be ignored. Return false to ensure that the discovered links are not crawled.
crawl_id: Box<String>Set the crawl ID to track. This allows explicit targeting for shutdown, pause, and etc.
extra_info: Option<Box<String>>extra_information only.Extra information to store.
cookies only.Cookie jar between request.
pb_quality_validator: Option<QualityValidator>parallel_backends only.Custom quality validator for parallel backend responses. Called after the built-in scorer. Can override, adjust, or reject scores.
Implementations§
Source§impl Website
impl Website
Sourcepub fn new_with_firewall(url: &str, check_firewall: bool) -> Self
pub fn new_with_firewall(url: &str, check_firewall: bool) -> Self
Initialize the Website with a starting link to crawl and check the firewall.
Sourcepub fn setup_database_handler(&self) -> Box<DatabaseHandler>
Available on crate feature disk only.
pub fn setup_database_handler(&self) -> Box<DatabaseHandler>
disk only.Setup a shared database.
Available on crate feature disk only.
disk only.Setup the sqlist usage.
Sourcepub fn setup_sqlite(&mut self)
Available on crate feature disk only.
pub fn setup_sqlite(&mut self)
disk only.Setup the sqlist usage.
Sourcepub fn set_url(&mut self, url: &str) -> &mut Self
pub fn set_url(&mut self, url: &str) -> &mut Self
Set the url of the website to re-use configuration and data.
Sourcepub fn set_url_only(&mut self, url: &str) -> &mut Self
pub fn set_url_only(&mut self, url: &str) -> &mut Self
Set the direct url of the website to re-use configuration and data without parsing the domain.
Sourcepub fn target_id(&self) -> String
pub fn target_id(&self) -> String
Get the target id for a crawl. This takes the crawl ID and the url and concats it without delimiters.
Sourcepub fn single_page(&self) -> bool
pub fn single_page(&self) -> bool
Single page request.
Sourcepub fn setup_disk(&mut self)
Available on crate feature disk only.
pub fn setup_disk(&mut self)
disk only.Setup SQLite. This does nothing with disk flag enabled.
Sourcepub fn set_disk_persistance(&mut self, persist: bool) -> &mut Self
Available on crate feature disk only.
pub fn set_disk_persistance(&mut self, persist: bool) -> &mut Self
disk only.Set the sqlite disk persistance.
Sourcepub fn get_robots_parser(&self) -> &Option<Box<RobotFileParser>>
pub fn get_robots_parser(&self) -> &Option<Box<RobotFileParser>>
Get the robots.txt parser.
Sourcepub fn get_requires_javascript(&self) -> bool
pub fn get_requires_javascript(&self) -> bool
Does the website require javascript to run?
Sourcepub fn get_website_meta_info(&self) -> &WebsiteMetaInfo
pub fn get_website_meta_info(&self) -> &WebsiteMetaInfo
Get the website meta information that can help with retry handling.
Sourcepub async fn is_allowed_disk(&self, url_to_check: &str) -> bool
Available on crate feature disk only.
pub async fn is_allowed_disk(&self, url_to_check: &str) -> bool
disk only.Check if URL exists (ignore case). This does nothing with disk flag enabled.
Sourcepub async fn is_allowed_signature_disk(&self, signature_to_check: u64) -> bool
Available on crate feature disk only.
pub async fn is_allowed_signature_disk(&self, signature_to_check: u64) -> bool
disk only.Check if signature exists (ignore case). This does nothing with disk flag enabled.
Sourcepub async fn is_signature_allowed(&self, signature: u64) -> bool
pub async fn is_signature_allowed(&self, signature: u64) -> bool
Is the signature allowed.
Sourcepub async fn clear_disk(&self)
Available on crate feature disk only.
pub async fn clear_disk(&self)
disk only.Clear the disk. This does nothing with disk flag enabled.
Sourcepub async fn insert_url_disk(&self, new_url: &str)
Available on crate feature disk only.
pub async fn insert_url_disk(&self, new_url: &str)
disk only.Insert a new URL to disk if it doesn’t exist. This does nothing with disk flag enabled.
Sourcepub async fn insert_signature_disk(&self, signature: u64)
Available on crate feature disk only.
pub async fn insert_signature_disk(&self, signature: u64)
disk only.Insert a new signature to disk if it doesn’t exist. This does nothing with disk flag enabled.
Sourcepub async fn insert_link(&mut self, new_url: &CaseInsensitiveString)
Available on crate feature disk only.
pub async fn insert_link(&mut self, new_url: &CaseInsensitiveString)
disk only.Insert a new URL if it doesn’t exist. This does nothing with disk flag enabled.
Accepts a reference to avoid cloning at the call site. The URL is only cloned internally when it is actually new and needs to be stored.
Sourcepub async fn insert_signature(&mut self, new_signature: u64)
Available on crate feature disk only.
pub async fn insert_signature(&mut self, new_signature: u64)
disk only.Insert a new signature if it doesn’t exist. This does nothing with disk flag enabled.
Sourcepub async fn seed(&mut self) -> Result<(), Error>
Available on crate feature disk only.
pub async fn seed(&mut self) -> Result<(), Error>
disk only.Seed the DB and clear the Hashset. This does nothing with disk flag enabled.
Sourcepub fn is_allowed(&mut self, link: &CaseInsensitiveString) -> ProcessLinkStatus
Available on crate feature regex only.
pub fn is_allowed(&mut self, link: &CaseInsensitiveString) -> ProcessLinkStatus
regex only.return true if URL:
- is not already crawled
- is not over depth
- is not over crawl budget
- is optionally whitelisted
- is not blacklisted
- is not forbidden in robot.txt file (if parameter is defined)
Sourcepub fn is_allowed_budgetless(
&mut self,
link: &CaseInsensitiveString,
) -> ProcessLinkStatus
Available on crate feature regex only.
pub fn is_allowed_budgetless( &mut self, link: &CaseInsensitiveString, ) -> ProcessLinkStatus
regex only.return true if URL:
- is not already crawled
- is not over depth
- is optionally whitelisted
- is not blacklisted
- is not forbidden in robot.txt file (if parameter is defined)
Sourcepub fn is_allowed_default(
&self,
link: &CaseInsensitiveString,
) -> ProcessLinkStatus
Available on crate feature regex only.
pub fn is_allowed_default( &self, link: &CaseInsensitiveString, ) -> ProcessLinkStatus
regex only.return true if URL:
- is optionally whitelisted
- is not blacklisted
- is not forbidden in robot.txt file (if parameter is defined)
Sourcepub fn is_allowed_robots(&self, link: &str) -> bool
pub fn is_allowed_robots(&self, link: &str) -> bool
return true if URL:
- is not forbidden in robot.txt file (if parameter is defined)
Sourcepub fn size(&self) -> usize
pub fn size(&self) -> usize
Amount of pages crawled in memory only. Use get_size for full links between memory and disk.
Sourcepub async fn get_size(&self) -> usize
Available on crate feature disk only.
pub async fn get_size(&self) -> usize
disk only.Get the amount of resources collected.
Sourcepub fn drain_extra_links(&mut self) -> Drain<'_, CaseInsensitiveString>
pub fn drain_extra_links(&mut self) -> Drain<'_, CaseInsensitiveString>
Drain the extra links used for things like the sitemap.
Sourcepub fn set_initial_status_code(&mut self, initial_status_code: StatusCode)
pub fn set_initial_status_code(&mut self, initial_status_code: StatusCode)
Set the initial status code of the request.
Sourcepub fn get_initial_status_code(&self) -> &StatusCode
pub fn get_initial_status_code(&self) -> &StatusCode
Get the initial status code of the request.
Sourcepub fn set_initial_html_length(&mut self, initial_html_length: usize)
pub fn set_initial_html_length(&mut self, initial_html_length: usize)
Set the initial html size of the request.
Sourcepub fn get_initial_html_length(&self) -> usize
pub fn get_initial_html_length(&self) -> usize
Get the initial html size of the request.
Sourcepub fn set_initial_anti_bot_tech(&mut self, initial_anti_bot_tech: AntiBotTech)
pub fn set_initial_anti_bot_tech(&mut self, initial_anti_bot_tech: AntiBotTech)
Set the initial anti-bot tech code used for the intitial request.
Sourcepub fn get_initial_anti_bot_tech(&self) -> &AntiBotTech
pub fn get_initial_anti_bot_tech(&self) -> &AntiBotTech
Get the initial anti-bot tech code used for the intitial request.
Sourcepub fn get_compiled_custom_antibot(&self) -> Option<&CompiledCustomAntibot>
pub fn get_compiled_custom_antibot(&self) -> Option<&CompiledCustomAntibot>
Get the compiled custom antibot patterns.
Sourcepub fn set_initial_page_waf_check(&mut self, initial_page_waf_check: bool)
pub fn set_initial_page_waf_check(&mut self, initial_page_waf_check: bool)
Set the initial waf detected used for the intitial request
Sourcepub fn get_initial_page_waf_check(&self) -> bool
pub fn get_initial_page_waf_check(&self) -> bool
Get the initial waf detected used for the intitial request.
Sourcepub fn set_initial_page_should_retry(&mut self, initial_page_should_retry: bool)
pub fn set_initial_page_should_retry(&mut self, initial_page_should_retry: bool)
Set the initial page should retry determination used for the intitial request.
Sourcepub fn get_initial_page_should_retry(&self) -> bool
pub fn get_initial_page_should_retry(&self) -> bool
Get the initial page should retry determination used for the intitial request.
Sourcepub fn drain_links(&mut self) -> Drain<'_, SymbolUsize>
Available on crate features string_interner_bucket_backend or string_interner_string_backend or string_interner_buffer_backend only.
pub fn drain_links(&mut self) -> Drain<'_, SymbolUsize>
string_interner_bucket_backend or string_interner_string_backend or string_interner_buffer_backend only.Drain the links visited.
Sourcepub fn drain_signatures(&mut self) -> Drain<'_, u64>
Available on crate features string_interner_bucket_backend or string_interner_string_backend or string_interner_buffer_backend only.
pub fn drain_signatures(&mut self) -> Drain<'_, u64>
string_interner_bucket_backend or string_interner_string_backend or string_interner_buffer_backend only.Drain the signatures visited.
Sourcepub fn set_extra_links(
&mut self,
extra_links: HashSet<CaseInsensitiveString>,
) -> &HashSet<CaseInsensitiveString>
pub fn set_extra_links( &mut self, extra_links: HashSet<CaseInsensitiveString>, ) -> &HashSet<CaseInsensitiveString>
Set extra links to crawl. This could be used in conjuntion with ‘website.persist_links’ to extend the crawl on the next run.
Sourcepub fn get_extra_links(&self) -> &HashSet<CaseInsensitiveString>
pub fn get_extra_links(&self) -> &HashSet<CaseInsensitiveString>
Get the extra links.
Sourcepub fn get_client(&self) -> &Option<Client>
pub fn get_client(&self) -> &Option<Client>
Get the HTTP request client. The client is set after the crawl has started.
Sourcepub async fn get_links_disk(&self) -> HashSet<CaseInsensitiveString>
Available on crate feature disk only.
pub async fn get_links_disk(&self) -> HashSet<CaseInsensitiveString>
disk only.Links visited getter for disk. This does nothing with disk flag enabled.
Sourcepub async fn get_all_links_visited(&self) -> HashSet<CaseInsensitiveString>
Available on crate feature disk only.
pub async fn get_all_links_visited(&self) -> HashSet<CaseInsensitiveString>
disk only.Links all the links visited between memory and disk.
Sourcepub fn get_links(&self) -> HashSet<CaseInsensitiveString>
pub fn get_links(&self) -> HashSet<CaseInsensitiveString>
Links visited getter for memory resources.
Sourcepub fn get_url_parsed(&self) -> &Option<Box<Url>>
pub fn get_url_parsed(&self) -> &Option<Box<Url>>
Domain parsed url getter.
Sourcepub fn get_url(&self) -> &CaseInsensitiveString
pub fn get_url(&self) -> &CaseInsensitiveString
Domain name getter.
Sourcepub fn get_auto_throttle(&self) -> Option<&Arc<AutoThrottle>>
Available on crate feature auto_throttle only.
pub fn get_auto_throttle(&self) -> Option<&Arc<AutoThrottle>>
auto_throttle only.Get the shared auto-throttle instance, if configured.
Sourcepub fn get_etag_cache(&self) -> Option<&Arc<ETagCache>>
Available on crate feature etag_cache only.
pub fn get_etag_cache(&self) -> Option<&Arc<ETagCache>>
etag_cache only.Get the shared ETag cache instance, if enabled.
Sourcepub fn get_warc_writer(&self) -> Option<&WarcWriter>
Available on crate feature warc only.
pub fn get_warc_writer(&self) -> Option<&WarcWriter>
warc only.Get the shared WARC writer instance, if configured.
Sourcepub fn warc_record_count(&self) -> u64
Available on crate feature warc only.
pub fn warc_record_count(&self) -> u64
warc only.Get the number of WARC records written so far.
Sourcepub fn get_status(&self) -> &CrawlStatus
pub fn get_status(&self) -> &CrawlStatus
Get the active crawl status.
Sourcepub fn set_status(&mut self, status: CrawlStatus) -> &CrawlStatus
pub fn set_status(&mut self, status: CrawlStatus) -> &CrawlStatus
Set the active crawl status. This is helpful when chaining crawls concurrently.
Sourcepub fn reset_status(&mut self) -> &CrawlStatus
pub fn reset_status(&mut self) -> &CrawlStatus
Reset the active crawl status to bypass websites that are blocked.
Sourcepub fn persist_links(&mut self) -> &mut Self
pub fn persist_links(&mut self) -> &mut Self
Set the crawl status to persist between the run. Example crawling a sitemap and all links after - website.crawl_sitemap().await.persist_links().crawl().await
Sourcepub fn get_absolute_path(&self, domain: Option<&str>) -> Option<Url>
pub fn get_absolute_path(&self, domain: Option<&str>) -> Option<Url>
Absolute base url of crawl.
Sourcepub async fn configure_robots_parser(&mut self, client: &Client)
pub async fn configure_robots_parser(&mut self, client: &Client)
configure the robots parser on initial crawl attempt and run.
Sourcepub fn setup_strict_policy(&self) -> Policy
pub fn setup_strict_policy(&self) -> Policy
Setup strict a strict redirect policy for request. All redirects need to match the host.
Sourcepub fn setup_redirect_policy(&self) -> Policy
pub fn setup_redirect_policy(&self) -> Policy
Setup redirect policy for reqwest.
Sourcepub fn configure_headers(&mut self)
pub fn configure_headers(&mut self)
Configure the headers to use.
Sourcepub fn set_http_client(&mut self, client: Client) -> &Option<Client>
pub fn set_http_client(&mut self, client: Client) -> &Option<Client>
Set the HTTP client to use directly. This is helpful if you manually call ‘website.configure_http_client’ before the crawl.
Sourcepub fn configure_http_client(&mut self) -> Client
Available on crate features decentralized and cache_request only.
pub fn configure_http_client(&mut self) -> Client
decentralized and cache_request only.Configure http client for decentralization.
Sourcepub fn configure_handler(&self) -> Option<(Arc<AtomicI8>, JoinHandle<()>)>
Available on crate feature control only.
pub fn configure_handler(&self) -> Option<(Arc<AtomicI8>, JoinHandle<()>)>
control only.Setup atomic controller. This does nothing without the ‘control’ feature flag enabled.
Sourcepub async fn setup_chrome_interception(
&self,
page: &Page,
) -> Option<JoinHandle<()>>
Available on crate features chrome and chrome_intercept only.
pub async fn setup_chrome_interception( &self, page: &Page, ) -> Option<JoinHandle<()>>
chrome and chrome_intercept only.Setup interception for chrome request.
Sourcepub fn setup_selectors(&self) -> RelativeSelectors
pub fn setup_selectors(&self) -> RelativeSelectors
Setup selectors for handling link targets.
Sourcepub fn setup_base(
&mut self,
) -> (Client, Option<(Arc<AtomicI8>, JoinHandle<()>)>)
pub fn setup_base( &mut self, ) -> (Client, Option<(Arc<AtomicI8>, JoinHandle<()>)>)
Base configuration setup.
Sourcepub async fn setup(
&mut self,
) -> (Client, Option<(Arc<AtomicI8>, JoinHandle<()>)>)
pub async fn setup( &mut self, ) -> (Client, Option<(Arc<AtomicI8>, JoinHandle<()>)>)
Setup config for crawl.
Sourcepub fn setup_crawl(&self) -> (Pin<Box<Interval>>, Pin<Box<Duration>>)
pub fn setup_crawl(&self) -> (Pin<Box<Interval>>, Pin<Box<Duration>>)
Setup shared concurrent configs.
Sourcepub fn get_expanded_links(
&self,
domain_name: &str,
) -> Vec<CaseInsensitiveString>
Available on crate feature glob only.
pub fn get_expanded_links( &self, domain_name: &str, ) -> Vec<CaseInsensitiveString>
glob only.Get all the expanded links.
Sourcepub fn set_crawl_initial_status(
&mut self,
page: &Page,
links: &HashSet<CaseInsensitiveString>,
)
pub fn set_crawl_initial_status( &mut self, page: &Page, links: &HashSet<CaseInsensitiveString>, )
Set the initial crawl status by page output.
Sourcepub async fn _crawl_establish_cmd(
&mut self,
cmd: PathBuf,
cmd_args: Vec<String>,
base: &mut RelativeSelectors,
_ssg_build: bool,
) -> HashSet<CaseInsensitiveString>
Available on crate feature cmd only.
pub async fn _crawl_establish_cmd( &mut self, cmd: PathBuf, cmd_args: Vec<String>, base: &mut RelativeSelectors, _ssg_build: bool, ) -> HashSet<CaseInsensitiveString>
cmd only.Expand links for crawl base establish using a command-based fetch.
Sourcepub async fn run_via_cmd(
cmd: &Path,
fixed_args: &[String],
url: &str,
) -> Result<Vec<u8>>
Available on crate feature cmd only.
pub async fn run_via_cmd( cmd: &Path, fixed_args: &[String], url: &str, ) -> Result<Vec<u8>>
cmd only.Run cmd and return stdout bytes.
Sourcepub async fn crawl_concurrent_cmd(
&mut self,
cmd: PathBuf,
cmd_args: Vec<String>,
handle: &Option<Arc<AtomicI8>>,
)
pub async fn crawl_concurrent_cmd( &mut self, cmd: PathBuf, cmd_args: Vec<String>, handle: &Option<Arc<AtomicI8>>, )
Start to crawl website concurrently using a cmd executable.
cmdis the executable (absolute preferred)cmd_argsare fixed args; can include “{url}” placeholder, otherwise url is appended.
Sourcepub async fn crawl_establish(
&mut self,
client: &Client,
_: &(CompactString, SmallVec<[CompactString; 2]>),
http_worker: bool,
) -> HashSet<CaseInsensitiveString>
Available on crate features glob and decentralized only.
pub async fn crawl_establish( &mut self, client: &Client, _: &(CompactString, SmallVec<[CompactString; 2]>), http_worker: bool, ) -> HashSet<CaseInsensitiveString>
glob and decentralized only.Expand links for crawl.
Sourcepub fn set_crawl_status(&mut self)
pub fn set_crawl_status(&mut self)
Set the crawl status depending on crawl state. The crawl that only changes if the state is Start or Active.
Sourcepub fn setup_semaphore(&self) -> Arc<Semaphore>
pub fn setup_semaphore(&self) -> Arc<Semaphore>
Setup the Semaphore for the crawl.
Sourcepub async fn crawl_sitemap(&mut self)
pub async fn crawl_sitemap(&mut self)
Start to crawl website with async concurrency using the sitemap. This does not page forward into the request. This does nothing without the sitemap flag enabled.
Sourcepub async fn configure_setup(&mut self)
pub async fn configure_setup(&mut self)
Configures the website crawling process for concurrent execution with the ability to send it across threads for subscriptions.
Sourcepub fn configure_setup_norobots(&mut self)
pub fn configure_setup_norobots(&mut self)
Configures the website crawling process for concurrent execution with the ability to send it across threads for subscriptions without robot protection.
You can manually call website.configure_robots_parser after.
Sourcepub async fn crawl_chrome_send(&self, _url: Option<&str>)
Available on crate features chrome and decentralized only.
pub async fn crawl_chrome_send(&self, _url: Option<&str>)
chrome and decentralized only.In decentralized builds, chrome send crawling is not supported and this is a no-op.
Sourcepub async fn crawl_smart(&mut self)
Available on crate features decentralized and smart only.
pub async fn crawl_smart(&mut self)
decentralized and smart only.Start to crawl website with async concurrency smart. Use HTTP first and JavaScript Rendering as needed. This has no effect without the smart flag enabled.
Sourcepub async fn crawl_raw(&mut self)
pub async fn crawl_raw(&mut self)
Start to crawl website with async concurrency using the base raw functionality. Useful when using the chrome feature and defaulting to the basic implementation.
Sourcepub async fn scrape_raw(&mut self)
pub async fn scrape_raw(&mut self)
Start to crawl website with async concurrency using the base raw functionality. Useful when using the “chrome” feature and defaulting to the basic implementation.
Sourcepub async fn scrape_smart(&mut self)
pub async fn scrape_smart(&mut self)
Start to scrape website with async concurrency smart. Use HTTP first and JavaScript Rendering as needed. This has no effect without the smart flag enabled.
Sourcepub async fn scrape_sitemap(&mut self)
pub async fn scrape_sitemap(&mut self)
Start to scrape website sitemap with async concurrency. Use HTTP first and JavaScript Rendering as needed. This has no effect without the sitemap flag enabled.
Sourcepub async fn crawl_concurrent(
&mut self,
client: &Client,
handle: &Option<Arc<AtomicI8>>,
)
pub async fn crawl_concurrent( &mut self, client: &Client, handle: &Option<Arc<AtomicI8>>, )
Start to crawl website concurrently.
Sourcepub async fn warm_up_gemini(&mut self)
pub async fn warm_up_gemini(&mut self)
Warm up the gemini model.
Sourcepub async fn sitemap_crawl(
&mut self,
client: &Client,
handle: &Option<Arc<AtomicI8>>,
scrape: bool,
)
Available on crate feature sitemap only.
pub async fn sitemap_crawl( &mut self, client: &Client, handle: &Option<Arc<AtomicI8>>, scrape: bool, )
sitemap only.Sitemap crawl entire lists. Note: this method does not re-crawl the links of the pages found on the sitemap. This does nothing without the sitemap flag.
Sourcepub async fn sitemap_parse(
&mut self,
client: &Client,
first_request: &mut bool,
sitemap_url: &mut Box<CompactString>,
attempted_correct: &mut bool,
) -> bool
pub async fn sitemap_parse( &mut self, client: &Client, first_request: &mut bool, sitemap_url: &mut Box<CompactString>, attempted_correct: &mut bool, ) -> bool
Sitemap parse entire lists. Note: this method does not re-crawl the links of the pages found on the sitemap. This does nothing without the sitemap flag.
Sourcepub fn get_base_link(&self) -> &CaseInsensitiveString
Available on crate feature regex only.
pub fn get_base_link(&self) -> &CaseInsensitiveString
regex only.get base link for crawl establishing.
Sourcepub async fn subscription_guard(&self)
pub async fn subscription_guard(&self)
Guard the channel from closing until all subscription events complete.
Sourcepub async fn setup_browser_base(
config: &Configuration,
url_parsed: &Option<Box<Url>>,
jar: Option<&Arc<Jar>>,
) -> Option<BrowserController>
Available on crate feature chrome only.
pub async fn setup_browser_base( config: &Configuration, url_parsed: &Option<Box<Url>>, jar: Option<&Arc<Jar>>, ) -> Option<BrowserController>
chrome only.Launch or connect to browser with setup.
Sourcepub async fn setup_browser(&self) -> Option<BrowserController>
Available on crate feature chrome only.
pub async fn setup_browser(&self) -> Option<BrowserController>
chrome only.Launch or connect to browser with setup.
Sourcepub async fn setup_webdriver(&self) -> Option<WebDriverController>
Available on crate feature webdriver only.
pub async fn setup_webdriver(&self) -> Option<WebDriverController>
webdriver only.Launch or connect to WebDriver with setup.
Sourcepub async fn render_webdriver_page(
&self,
url: &str,
driver: &Arc<WebDriver>,
) -> Option<String>
Available on crate feature webdriver only.
pub async fn render_webdriver_page( &self, url: &str, driver: &Arc<WebDriver>, ) -> Option<String>
webdriver only.Render a page using WebDriver.
Sourcepub fn with_respect_robots_txt(&mut self, respect_robots_txt: bool) -> &mut Self
pub fn with_respect_robots_txt(&mut self, respect_robots_txt: bool) -> &mut Self
Respect robots.txt file.
Sourcepub fn with_subdomains(&mut self, subdomains: bool) -> &mut Self
pub fn with_subdomains(&mut self, subdomains: bool) -> &mut Self
Include subdomains detection.
Sourcepub fn with_csp_bypass(&mut self, enabled: bool) -> &mut Self
pub fn with_csp_bypass(&mut self, enabled: bool) -> &mut Self
Bypass CSP protection detection. This does nothing without the feat flag chrome enabled.
Sourcepub fn with_webdriver(&mut self, webdriver_config: WebDriverConfig) -> &mut Self
Available on crate feature webdriver only.
pub fn with_webdriver(&mut self, webdriver_config: WebDriverConfig) -> &mut Self
webdriver only.Configure WebDriver for browser automation. This does nothing without the webdriver feature flag enabled.
When configured, the crawl() function will automatically use WebDriver instead of raw HTTP.
Sourcepub fn with_sqlite(&mut self, sqlite: bool) -> &mut Self
Available on crate feature disk only.
pub fn with_sqlite(&mut self, sqlite: bool) -> &mut Self
disk only.Use sqlite to store data and track large crawls. This does nothing without the disk flag enabled.
Sourcepub fn with_crawl_timeout(
&mut self,
crawl_timeout: Option<Duration>,
) -> &mut Self
pub fn with_crawl_timeout( &mut self, crawl_timeout: Option<Duration>, ) -> &mut Self
The max duration for the crawl. This is useful when websites use a robots.txt with long durations and throttle the timeout removing the full concurrency.
Sourcepub fn with_http2_prior_knowledge(
&mut self,
http2_prior_knowledge: bool,
) -> &mut Self
pub fn with_http2_prior_knowledge( &mut self, http2_prior_knowledge: bool, ) -> &mut Self
Only use HTTP/2.
Sourcepub fn with_delay(&mut self, delay: u64) -> &mut Self
pub fn with_delay(&mut self, delay: u64) -> &mut Self
Delay between request as ms.
Sourcepub fn with_request_timeout(
&mut self,
request_timeout: Option<Duration>,
) -> &mut Self
pub fn with_request_timeout( &mut self, request_timeout: Option<Duration>, ) -> &mut Self
Max time to wait for request.
Sourcepub fn with_danger_accept_invalid_certs(
&mut self,
accept_invalid_certs: bool,
) -> &mut Self
pub fn with_danger_accept_invalid_certs( &mut self, accept_invalid_certs: bool, ) -> &mut Self
Dangerously accept invalid certificates - this should be used as a last resort.
Sourcepub fn with_user_agent(&mut self, user_agent: Option<&str>) -> &mut Self
pub fn with_user_agent(&mut self, user_agent: Option<&str>) -> &mut Self
Add user agent to request.
Sourcepub fn with_preserve_host_header(&mut self, preserve: bool) -> &mut Self
pub fn with_preserve_host_header(&mut self, preserve: bool) -> &mut Self
Preserve the HOST header.
Sourcepub fn with_sitemap(&mut self, sitemap_url: Option<&str>) -> &mut Self
Available on crate feature sitemap only.
pub fn with_sitemap(&mut self, sitemap_url: Option<&str>) -> &mut Self
sitemap only.Add user agent to request. This does nothing without the sitemap flag enabled.
Sourcepub fn with_proxies(&mut self, proxies: Option<Vec<String>>) -> &mut Self
pub fn with_proxies(&mut self, proxies: Option<Vec<String>>) -> &mut Self
Use proxies for request.
Sourcepub fn with_proxies_direct(
&mut self,
proxies: Option<Vec<RequestProxy>>,
) -> &mut Self
pub fn with_proxies_direct( &mut self, proxies: Option<Vec<RequestProxy>>, ) -> &mut Self
Use proxies for request with control between chrome and http.
Sourcepub fn with_concurrency_limit(&mut self, limit: Option<usize>) -> &mut Self
pub fn with_concurrency_limit(&mut self, limit: Option<usize>) -> &mut Self
Set the concurrency limits. If you set the value to None to use the default limits using the system CPU cors * n.
Sourcepub fn with_crawl_id(&mut self, crawl_id: String) -> &mut Self
Available on crate feature control only.
pub fn with_crawl_id(&mut self, crawl_id: String) -> &mut Self
control only.Set a crawl ID to use for tracking crawls. This does nothing without the [control] flag enabled.
Sourcepub fn with_blacklist_url<T>(
&mut self,
blacklist_url: Option<Vec<T>>,
) -> &mut Self
pub fn with_blacklist_url<T>( &mut self, blacklist_url: Option<Vec<T>>, ) -> &mut Self
Add blacklist urls to ignore.
Sourcepub fn with_retry(&mut self, retry: u8) -> &mut Self
pub fn with_retry(&mut self, retry: u8) -> &mut Self
Set the retry limit for request. Set the value to 0 for no retries. The default is 0.
Sourcepub fn with_no_control_thread(&mut self, no_control_thread: bool) -> &mut Self
pub fn with_no_control_thread(&mut self, no_control_thread: bool) -> &mut Self
Skip setting up a control thread for pause, start, and shutdown programmatic handling. This does nothing without the ‘control’ flag enabled.
Sourcepub fn with_whitelist_url<T>(
&mut self,
whitelist_url: Option<Vec<T>>,
) -> &mut Self
pub fn with_whitelist_url<T>( &mut self, whitelist_url: Option<Vec<T>>, ) -> &mut Self
Add whitelist urls to allow.
Sourcepub fn with_event_tracker(
&mut self,
track_events: Option<ChromeEventTracker>,
) -> &mut Self
Available on crate feature chrome only.
pub fn with_event_tracker( &mut self, track_events: Option<ChromeEventTracker>, ) -> &mut Self
chrome only.Track the events made via chrome.
Sourcepub fn with_headers(&mut self, headers: Option<HeaderMap>) -> &mut Self
pub fn with_headers(&mut self, headers: Option<HeaderMap>) -> &mut Self
Set HTTP headers for request using reqwest::header::HeaderMap.
Sourcepub fn with_modify_headers(&mut self, modify_headers: bool) -> &mut Self
pub fn with_modify_headers(&mut self, modify_headers: bool) -> &mut Self
Modify the headers to mimic a real browser.
Sourcepub fn with_modify_http_client_headers(
&mut self,
modify_http_client_headers: bool,
) -> &mut Self
pub fn with_modify_http_client_headers( &mut self, modify_http_client_headers: bool, ) -> &mut Self
Modify the HTTP client headers to mimic a real browser.
Sourcepub fn with_budget(&mut self, budget: Option<HashMap<&str, u32>>) -> &mut Self
pub fn with_budget(&mut self, budget: Option<HashMap<&str, u32>>) -> &mut Self
Set a crawl budget per path with levels support /a/b/c or for all paths with “*”. This does nothing without the budget flag enabled.
Sourcepub fn set_crawl_budget(
&mut self,
budget: Option<HashMap<CaseInsensitiveString, u32>>,
)
pub fn set_crawl_budget( &mut self, budget: Option<HashMap<CaseInsensitiveString, u32>>, )
Set the crawl budget directly. This does nothing without the budget flag enabled.
Sourcepub fn with_depth(&mut self, depth: usize) -> &mut Self
pub fn with_depth(&mut self, depth: usize) -> &mut Self
Set a crawl depth limit. If the value is 0 there is no limit.
Sourcepub fn with_external_domains<'a, 'b>(
&mut self,
external_domains: Option<impl Iterator<Item = String> + 'a>,
) -> &mut Self
pub fn with_external_domains<'a, 'b>( &mut self, external_domains: Option<impl Iterator<Item = String> + 'a>, ) -> &mut Self
Group external domains to treat the crawl as one. If None is passed this will clear all prior domains.
Sourcepub fn with_on_link_find_callback(
&mut self,
on_link_find_callback: Option<OnLinkFindCallback>,
) -> &mut Self
pub fn with_on_link_find_callback( &mut self, on_link_find_callback: Option<OnLinkFindCallback>, ) -> &mut Self
Perform a callback to run on each link find.
Sourcepub fn set_on_link_find<F>(&mut self, f: F)where
F: Fn(CaseInsensitiveString, Option<String>) -> (CaseInsensitiveString, Option<String>) + Send + Sync + 'static,
pub fn set_on_link_find<F>(&mut self, f: F)where
F: Fn(CaseInsensitiveString, Option<String>) -> (CaseInsensitiveString, Option<String>) + Send + Sync + 'static,
Perform a callback to run on each link find shorthand.
Sourcepub fn with_on_should_crawl_callback(
&mut self,
on_should_crawl_callback: Option<fn(&Page) -> bool>,
) -> &mut Self
pub fn with_on_should_crawl_callback( &mut self, on_should_crawl_callback: Option<fn(&Page) -> bool>, ) -> &mut Self
Use a callback to determine if a page should be ignored. Return false to ensure that the discovered links are not crawled.
Sourcepub fn with_on_should_crawl_callback_closure<F: OnShouldCrawlClosure>(
&mut self,
on_should_crawl_closure: Option<F>,
) -> &mut Self
pub fn with_on_should_crawl_callback_closure<F: OnShouldCrawlClosure>( &mut self, on_should_crawl_closure: Option<F>, ) -> &mut Self
Use an immutable closure to determine if a page should be ignored. Return false to ensure that the discovered links are not crawled.
Slightly slower than Self::with_on_should_crawl_callback.
Cookie string to use in request. This does nothing without the cookies flag enabled.
Sourcepub fn with_cron(&mut self, cron_str: &str, cron_type: CronType) -> &mut Self
pub fn with_cron(&mut self, cron_str: &str, cron_type: CronType) -> &mut Self
Setup cron jobs to run. This does nothing without the cron flag enabled.
Sourcepub fn with_locale(&mut self, locale: Option<String>) -> &mut Self
pub fn with_locale(&mut self, locale: Option<String>) -> &mut Self
Overrides default host system locale with the specified one. This does nothing without the chrome flag enabled.
Sourcepub fn with_stealth(&mut self, stealth_mode: bool) -> &mut Self
pub fn with_stealth(&mut self, stealth_mode: bool) -> &mut Self
Use stealth mode for the request. This does nothing without the chrome flag enabled.
Sourcepub fn with_stealth_advanced(&mut self, stealth_mode: Tier) -> &mut Self
Available on crate feature chrome only.
pub fn with_stealth_advanced(&mut self, stealth_mode: Tier) -> &mut Self
chrome only.Use stealth mode for the request. This does nothing without the chrome flag enabled.
Sourcepub fn with_cache_policy(
&mut self,
cache_policy: Option<BasicCachePolicy>,
) -> &mut Self
pub fn with_cache_policy( &mut self, cache_policy: Option<BasicCachePolicy>, ) -> &mut Self
Set the cache policy.
Sourcepub fn with_openai(&mut self, openai_configs: Option<GPTConfigs>) -> &mut Self
pub fn with_openai(&mut self, openai_configs: Option<GPTConfigs>) -> &mut Self
Use OpenAI to get dynamic javascript to drive the browser. This does nothing without the openai flag enabled.
Sourcepub fn with_remote_multimodal(
&mut self,
cfg: Option<RemoteMultimodalConfigs>,
) -> &mut Self
Available on crate feature chrome only.
pub fn with_remote_multimodal( &mut self, cfg: Option<RemoteMultimodalConfigs>, ) -> &mut Self
chrome only.Use a remote multimodal model (vision + HTML + URL) to drive browser automation.
When enabled, Spider can ask an OpenAI-compatible “chat completions” endpoint to
generate a JSON plan (a list of WebAutomation steps), execute those steps against a
live Chrome page, then re-capture state and iterate until the model reports it is done
(or the configured limits are hit). The default system prompt is set to handle web challenges that can be adjusted if required.
Take a look at DEFAULT_SYSTEM_PROMPT at spider::features::automation::DEFAULT_SYSTEM_PROMPT for a base line.
This is useful for:
- handling captchas,
- dismissing popups / cookie banners,
- navigating to a target page (pricing, docs, etc.),
- clicking through multi-step UI flows,
- recovering from dynamic page state that plain HTML scraping can’t handle.
§Feature gate
This method only has an effect when the crate is built with feature="chrome".
Without chrome, the method is not available.
§Parameters
cfg: The remote multimodal configuration bundle (endpoint, model, prompts, and runtime knobs). PassNoneto disable remote multimodal automation.
§Example
use spider::website::Website;
use spider::configuration::Configuration;
use spider::features::automation::{RemoteMultimodalConfigs, RemoteMultimodalConfig};
// Build the engine configs (similar to GPTConfigs::new(...))
let mm_cfgs = RemoteMultimodalConfigs::new(
"http://localhost:11434/v1/chat/completions",
"qwen2.5-vl", // any OpenAI-compatible model id your endpoint understands
)
// .with_api_key("your-api-key-if-needed")
.with_system_prompt_extra("Never log in. Prefer closing popups and continuing.")
.with_user_message_extra("Goal: reach the pricing page, then stop.")
.with_cfg(RemoteMultimodalConfig {
// keep HTML smaller if you want lower token usage
include_html: true,
html_max_bytes: 24_000,
include_url: true,
include_title: true,
// loop controls
max_rounds: 6,
post_plan_wait_ms: 400,
..Default::default()
})
.with_concurrency_limit(8);
// Attach to the crawler configuration
let mut cfg = Configuration::new();
cfg.with_remote_multimodal(Some(mm_cfgs));
// Use the configuration in a Website (example)
let mut site = Website::new("https://example.com");
site.with_config(cfg);
// Start crawling/scraping as you normally would...
// site.crawl().await?;
Ok(())§Notes
- Remote multimodal automation typically requires
feature="serde"if you deserialize model steps intoWebAutomation. - If your endpoint does not support
response_format: {"type":"json_object"}, disable that inRemoteMultimodalConfig(request_json_object = false).
Sourcepub fn with_gemini(
&mut self,
gemini_configs: Option<GeminiConfigs>,
) -> &mut Self
pub fn with_gemini( &mut self, gemini_configs: Option<GeminiConfigs>, ) -> &mut Self
Use Gemini to get dynamic javascript to drive the browser. This does nothing without the gemini flag enabled.
Sourcepub fn with_caching(&mut self, cache: bool) -> &mut Self
pub fn with_caching(&mut self, cache: bool) -> &mut Self
Cache the page following HTTP rules. This method does nothing if the cache feature is not enabled.
Sourcepub fn with_cache_skip_browser(&mut self, skip: bool) -> &mut Self
pub fn with_cache_skip_browser(&mut self, skip: bool) -> &mut Self
Skip browser rendering entirely if cached content exists.
Sourcepub fn with_cache_namespace<S: Into<String>>(
&mut self,
namespace: Option<S>,
) -> &mut Self
pub fn with_cache_namespace<S: Into<String>>( &mut self, namespace: Option<S>, ) -> &mut Self
Partition the cache by an opaque namespace (e.g. country, proxy pool,
tenant, A/B bucket, device profile, …). Cached bytes are never shared
across namespaces. None uses the default (empty) namespace. This
method does nothing without any of the cache_request, chrome, or
chrome_remote_cache features.
Sourcepub fn with_service_worker_enabled(&mut self, enabled: bool) -> &mut Self
pub fn with_service_worker_enabled(&mut self, enabled: bool) -> &mut Self
Enable or disable Service Workers. This method does nothing if the chrome feature is not enabled.
Sourcepub fn with_auto_geolocation(&mut self, enabled: bool) -> &mut Self
pub fn with_auto_geolocation(&mut self, enabled: bool) -> &mut Self
Automatically setup geo-location configurations when using a proxy. This method does nothing if the chrome feature is not enabled.
Sourcepub fn with_fingerprint_advanced(
&mut self,
fingerprint: Fingerprint,
) -> &mut Self
Available on crate feature chrome only.
pub fn with_fingerprint_advanced( &mut self, fingerprint: Fingerprint, ) -> &mut Self
chrome only.Set custom fingerprint ID for request. This does nothing without the chrome flag enabled.
Sourcepub fn with_fingerprint(&mut self, fingerprint: bool) -> &mut Self
pub fn with_fingerprint(&mut self, fingerprint: bool) -> &mut Self
Setup custom fingerprinting for chrome. This method does nothing if the chrome feature is not enabled.
Sourcepub fn with_viewport(&mut self, viewport: Option<Viewport>) -> &mut Self
pub fn with_viewport(&mut self, viewport: Option<Viewport>) -> &mut Self
Configures the viewport of the browser, which defaults to 800x600. This method does nothing if the chrome feature is not enabled.
Sourcepub fn with_wait_for_idle_network(
&mut self,
wait_for_idle_network: Option<WaitForIdleNetwork>,
) -> &mut Self
pub fn with_wait_for_idle_network( &mut self, wait_for_idle_network: Option<WaitForIdleNetwork>, ) -> &mut Self
Wait for network request to be idle within a time frame period (500ms no network connections). This does nothing without the chrome flag enabled.
Sourcepub fn with_wait_for_idle_network0(
&mut self,
wait_for_idle_network: Option<WaitForIdleNetwork>,
) -> &mut Self
pub fn with_wait_for_idle_network0( &mut self, wait_for_idle_network: Option<WaitForIdleNetwork>, ) -> &mut Self
Wait for network request with a max timeout. This does nothing without the chrome flag enabled.
Sourcepub fn with_wait_for_almost_idle_network0(
&mut self,
wait_for_idle_network: Option<WaitForIdleNetwork>,
) -> &mut Self
pub fn with_wait_for_almost_idle_network0( &mut self, wait_for_idle_network: Option<WaitForIdleNetwork>, ) -> &mut Self
Wait for network to be almost idle with a max timeout. This does nothing without the chrome flag enabled.
Sourcepub fn with_wait_for_selector(
&mut self,
wait_for_selector: Option<WaitForSelector>,
) -> &mut Self
pub fn with_wait_for_selector( &mut self, wait_for_selector: Option<WaitForSelector>, ) -> &mut Self
Wait for a CSS query selector. This method does nothing if the chrome feature is not enabled.
Sourcepub fn with_wait_for_idle_dom(
&mut self,
wait_for_selector: Option<WaitForSelector>,
) -> &mut Self
pub fn with_wait_for_idle_dom( &mut self, wait_for_selector: Option<WaitForSelector>, ) -> &mut Self
Wait for idle dom mutations for target element. This method does nothing if the chrome feature is not enabled.
Sourcepub fn with_wait_for_delay(
&mut self,
wait_for_delay: Option<WaitForDelay>,
) -> &mut Self
pub fn with_wait_for_delay( &mut self, wait_for_delay: Option<WaitForDelay>, ) -> &mut Self
Wait for a delay. Should only be used for testing. This method does nothing if the chrome feature is not enabled.
Sourcepub fn with_default_http_connect_timeout(
&mut self,
default_http_connect_timeout: Option<Duration>,
) -> &mut Self
pub fn with_default_http_connect_timeout( &mut self, default_http_connect_timeout: Option<Duration>, ) -> &mut Self
The default http connect timeout.
Sourcepub fn with_default_http_read_timeout(
&mut self,
default_http_read_timeout: Option<Duration>,
) -> &mut Self
pub fn with_default_http_read_timeout( &mut self, default_http_read_timeout: Option<Duration>, ) -> &mut Self
The default http read timeout.
Sourcepub fn with_redirect_limit(&mut self, redirect_limit: usize) -> &mut Self
pub fn with_redirect_limit(&mut self, redirect_limit: usize) -> &mut Self
Set the max redirects allowed for request.
Sourcepub fn with_redirect_policy(&mut self, policy: RedirectPolicy) -> &mut Self
pub fn with_redirect_policy(&mut self, policy: RedirectPolicy) -> &mut Self
Set the redirect policy to use, either Strict or Loose by default.
Sourcepub fn with_chrome_intercept(
&mut self,
chrome_intercept: RequestInterceptConfiguration,
) -> &mut Self
pub fn with_chrome_intercept( &mut self, chrome_intercept: RequestInterceptConfiguration, ) -> &mut Self
Use request intercept for the request to only allow content that matches the host. If the content is from a 3rd party it needs to be part of our include list. This method does nothing if the chrome_intercept flag is not enabled.
Sourcepub fn with_referer(&mut self, referer: Option<String>) -> &mut Self
pub fn with_referer(&mut self, referer: Option<String>) -> &mut Self
Add a referer to the request.
Sourcepub fn with_referrer(&mut self, referer: Option<String>) -> &mut Self
pub fn with_referrer(&mut self, referer: Option<String>) -> &mut Self
Add a referer to the request.
Sourcepub fn with_full_resources(&mut self, full_resources: bool) -> &mut Self
pub fn with_full_resources(&mut self, full_resources: bool) -> &mut Self
Determine whether to collect all the resources found on pages.
Sourcepub fn with_dismiss_dialogs(&mut self, full_resources: bool) -> &mut Self
pub fn with_dismiss_dialogs(&mut self, full_resources: bool) -> &mut Self
Dismiss all dialogs on the page. This method does nothing if the chrome feature is not enabled.
Sourcepub fn with_emulation(&mut self, emulation: Option<Emulation>) -> &mut Self
Available on crate feature wreq only.
pub fn with_emulation(&mut self, emulation: Option<Emulation>) -> &mut Self
wreq only.Set the request emuluation. This method does nothing if the wreq flag is not enabled.
Sourcepub fn with_ignore_sitemap(&mut self, ignore_sitemap: bool) -> &mut Self
pub fn with_ignore_sitemap(&mut self, ignore_sitemap: bool) -> &mut Self
Ignore the sitemap when crawling. This method does nothing if the sitemap flag is not enabled.
Sourcepub fn with_timezone_id(&mut self, timezone_id: Option<String>) -> &mut Self
pub fn with_timezone_id(&mut self, timezone_id: Option<String>) -> &mut Self
Overrides default host system timezone with the specified one. This does nothing without the chrome flag enabled.
Sourcepub fn with_evaluate_on_new_document(
&mut self,
evaluate_on_new_document: Option<Box<String>>,
) -> &mut Self
pub fn with_evaluate_on_new_document( &mut self, evaluate_on_new_document: Option<Box<String>>, ) -> &mut Self
Set a custom script to evaluate on new document creation. This does nothing without the feat flag chrome enabled.
Sourcepub fn with_limit(&mut self, limit: u32) -> &mut Self
pub fn with_limit(&mut self, limit: u32) -> &mut Self
Set a crawl page limit. If the value is 0 there is no limit.
Sourcepub fn with_quality_validator(
&mut self,
validator: Option<QualityValidator>,
) -> &mut Self
Available on crate feature parallel_backends only.
pub fn with_quality_validator( &mut self, validator: Option<QualityValidator>, ) -> &mut Self
parallel_backends only.Set a custom quality validator for parallel backend responses.
The validator is called for every backend response (including the
primary) after the built-in quality scorer. It can override, adjust,
or reject scores. This does nothing without the parallel_backends
feature.
§Example
use spider::utils::parallel_backends::{QualityValidator, ValidationResult};
website.with_quality_validator(Some(std::sync::Arc::new(|content, status, url, source| {
let mut result = ValidationResult::default();
// Reject any response under 1KB
if content.map_or(true, |b| b.len() < 1024) {
result.reject = true;
}
// Boost CDP backend by 10 points
if source == "cdp" {
result.score_adjust = 10;
}
result
})));Sourcepub fn with_screenshot(
&mut self,
screenshot_config: Option<ScreenShotConfig>,
) -> &mut Self
pub fn with_screenshot( &mut self, screenshot_config: Option<ScreenShotConfig>, ) -> &mut Self
Set the chrome screenshot configuration. This does nothing without the chrome flag enabled.
Use a shared semaphore to evenly handle workloads. The default is false.
Sourcepub fn with_auth_challenge_response(
&mut self,
auth_challenge_response: Option<AuthChallengeResponse>,
) -> &mut Self
pub fn with_auth_challenge_response( &mut self, auth_challenge_response: Option<AuthChallengeResponse>, ) -> &mut Self
Set the authentiation challenge response. This does nothing without the feat flag chrome enabled.
Sourcepub fn with_return_page_links(&mut self, return_page_links: bool) -> &mut Self
pub fn with_return_page_links(&mut self, return_page_links: bool) -> &mut Self
Return the links found on the page in the channel subscriptions. This method does nothing if the decentralized is enabled.
Sourcepub fn with_chrome_connection(
&mut self,
chrome_connection_url: Option<String>,
) -> &mut Self
pub fn with_chrome_connection( &mut self, chrome_connection_url: Option<String>, ) -> &mut Self
Set the connection url for the chrome instance. This method does nothing if the chrome is not enabled.
Sourcepub fn with_chrome_connections(&mut self, urls: Vec<String>) -> &mut Self
pub fn with_chrome_connections(&mut self, urls: Vec<String>) -> &mut Self
Set multiple remote Chrome connection URLs for failover. This method does nothing if the chrome is not enabled.
Sourcepub fn with_execution_scripts(
&mut self,
execution_scripts: Option<ExecutionScriptsMap>,
) -> &mut Self
pub fn with_execution_scripts( &mut self, execution_scripts: Option<ExecutionScriptsMap>, ) -> &mut Self
Set JS to run on certain pages. This method does nothing if the chrome is not enabled.
Sourcepub fn with_automation_scripts(
&mut self,
automation_scripts: Option<AutomationScriptsMap>,
) -> &mut Self
pub fn with_automation_scripts( &mut self, automation_scripts: Option<AutomationScriptsMap>, ) -> &mut Self
Run web automated actions on certain pages. This method does nothing if the chrome is not enabled.
Sourcepub fn with_network_interface(
&mut self,
network_interface: Option<String>,
) -> &mut Self
pub fn with_network_interface( &mut self, network_interface: Option<String>, ) -> &mut Self
Bind the connections only on the network interface.
Sourcepub fn with_local_address(&mut self, local_address: Option<IpAddr>) -> &mut Self
pub fn with_local_address(&mut self, local_address: Option<IpAddr>) -> &mut Self
Bind to a local IP Address.
Sourcepub fn with_block_assets(&mut self, only_html: bool) -> &mut Self
pub fn with_block_assets(&mut self, only_html: bool) -> &mut Self
Block assets from loading from the network. Focus primarly on HTML documents.
Sourcepub fn with_normalize(&mut self, normalize: bool) -> &mut Self
pub fn with_normalize(&mut self, normalize: bool) -> &mut Self
Normalize the content de-duplicating trailing slash pages and other pages that can be duplicated. This may initially show the link in your links_visited or subscription calls but, the following links will not be crawled.
Store all the links found on the disk to share the state. This does nothing without the disk flag enabled.
Sourcepub fn with_max_page_bytes(&mut self, max_page_bytes: Option<f64>) -> &mut Self
pub fn with_max_page_bytes(&mut self, max_page_bytes: Option<f64>) -> &mut Self
Set the max amount of bytes to collect per page. Only used for chrome atm.
Sourcepub fn with_max_bytes_allowed(
&mut self,
max_bytes_allowed: Option<u64>,
) -> &mut Self
pub fn with_max_bytes_allowed( &mut self, max_bytes_allowed: Option<u64>, ) -> &mut Self
Set the max amount of bytes to collected for the browser context. Only used for chrome atm.
Sourcepub fn with_config(&mut self, config: Configuration) -> &mut Self
pub fn with_config(&mut self, config: Configuration) -> &mut Self
Set the configuration for the website directly.
Sourcepub fn with_spider_cloud(&mut self, api_key: &str) -> &mut Self
Available on crate feature spider_cloud only.
pub fn with_spider_cloud(&mut self, api_key: &str) -> &mut Self
spider_cloud only.Set a spider.cloud API key (Proxy mode).
Sourcepub fn with_spider_cloud_config(
&mut self,
config: SpiderCloudConfig,
) -> &mut Self
Available on crate feature spider_cloud only.
pub fn with_spider_cloud_config( &mut self, config: SpiderCloudConfig, ) -> &mut Self
spider_cloud only.Set a spider.cloud config.
Sourcepub fn with_spider_browser(&mut self, api_key: &str) -> &mut Self
Available on crate features spider_cloud and chrome only.
pub fn with_spider_browser(&mut self, api_key: &str) -> &mut Self
spider_cloud and chrome only.Connect to Spider Browser Cloud via CDP over WebSocket.
Sourcepub fn with_spider_browser_config(
&mut self,
config: SpiderBrowserConfig,
) -> &mut Self
Available on crate features spider_cloud and chrome only.
pub fn with_spider_browser_config( &mut self, config: SpiderBrowserConfig, ) -> &mut Self
spider_cloud and chrome only.Connect to Spider Browser Cloud with full configuration.
Sourcepub fn with_hedge(&mut self, config: HedgeConfig) -> &mut Self
Available on crate feature hedge only.
pub fn with_hedge(&mut self, config: HedgeConfig) -> &mut Self
hedge only.Set the hedged request (work-stealing) configuration.
Sourcepub fn build(&self) -> Result<Self, Self>
pub fn build(&self) -> Result<Self, Self>
Build the website configuration when using with_builder.
Sourcepub fn clear_headers(&mut self)
pub fn clear_headers(&mut self)
Clear the HTTP headers for the requests.
Sourcepub fn determine_limits(&mut self)
pub fn determine_limits(&mut self)
Determine if the budget has a wildcard path and the depth limit distance. This does nothing without the budget flag enabled.
Sourcepub fn subscribe(&mut self, capacity: usize) -> Receiver<Page>
pub fn subscribe(&mut self, capacity: usize) -> Receiver<Page>
Sets up a subscription to receive concurrent data. This will panic if it is larger than usize::MAX / 2.
Set the value to 0 to use the semaphore permits. If the subscription is going to block or use async methods,
make sure to spawn a task to avoid losing messages.
§Examples
Subscribe and receive messages using an async tokio environment:
use spider::{tokio, website::Website};
#[tokio::main]
async fn main() {
let mut website = Website::new("http://example.com");
let mut rx = website.subscribe(0);
tokio::spawn(async move {
while let Ok(page) = rx.recv().await {
tokio::spawn(async move {
// Process the received page.
// If performing non-blocking tasks or managing a high subscription count, configure accordingly.
});
}
});
website.crawl().await;
}Sourcepub fn queue(&mut self, capacity: usize) -> Option<Sender<String>>
Available on crate feature sync only.
pub fn queue(&mut self, capacity: usize) -> Option<Sender<String>>
sync only.Get a sender for queueing extra links mid crawl. This does nothing unless the sync flag is enabled.
Sourcepub fn unsubscribe(&mut self)
Available on crate feature sync only.
pub fn unsubscribe(&mut self)
sync only.Remove subscriptions for data. This is useful for auto droping subscriptions that are running on another thread. This does nothing without the sync flag enabled.
Sourcepub fn get_channel(&self) -> &Option<(Sender<Page>, Arc<Receiver<Page>>)>
pub fn get_channel(&self) -> &Option<(Sender<Page>, Arc<Receiver<Page>>)>
Get the channel sender to send manual subscriptions.
Sourcepub fn get_channel_guard(&self) -> &Option<ChannelGuard>
pub fn get_channel_guard(&self) -> &Option<ChannelGuard>
Get the channel guard to send manual subscriptions from closing.
Sourcepub fn subscribe_guard(&mut self) -> Option<ChannelGuard>
Available on crate feature sync only.
pub fn subscribe_guard(&mut self) -> Option<ChannelGuard>
sync only.Setup subscription counter to track concurrent operation completions.
This helps keep a chrome instance active until all operations are completed from all threads to safely take screenshots and other actions.
Make sure to call inc if you take a guard. Without calling inc in the subscription receiver the crawl will stay in a infinite loop.
This does nothing without the sync flag enabled. You also need to use the ‘chrome_store_page’ to keep the page alive between request.
§Example
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("http://example.com");
let mut rx2 = website.subscribe(18);
let mut rxg = website.subscribe_guard().unwrap();
tokio::spawn(async move {
while let Ok(page) = rx2.recv().await {
println!("📸 - {:?}", page.get_url());
page
.screenshot(
true,
true,
spider::configuration::CaptureScreenshotFormat::Png,
Some(75),
None::<std::path::PathBuf>,
None,
)
.await;
rxg.inc();
}
});
website.crawl().await;
}Sourcepub async fn run_cron(&self) -> Runner
Available on crate feature cron only.
pub async fn run_cron(&self) -> Runner
cron only.Start a cron job - if you use subscribe on another thread you need to abort the handle in conjuction with runner.stop.
Sourcepub fn get_crawl_id(&self) -> Option<&Box<String>>
Available on crate feature control only.
pub fn get_crawl_id(&self) -> Option<&Box<String>>
control only.Get the attached crawl id.
Sourcepub fn set_extra_info(&mut self, info: Option<String>)
Available on crate feature extra_information only.
pub fn set_extra_info(&mut self, info: Option<String>)
extra_information only.Set extra useful information.
Sourcepub fn get_extra_info(&self) -> Option<&Box<String>>
Available on crate feature extra_information only.
pub fn get_extra_info(&self) -> Option<&Box<String>>
extra_information only.Get extra information stored.
Sourcepub fn set_seeded_html(&mut self, html: Option<String>)
pub fn set_seeded_html(&mut self, html: Option<String>)
Set the initial HTML page instead of firing a request to the URL.
Sourcepub fn get_seeded_html(&self) -> &Option<String>
pub fn get_seeded_html(&self) -> &Option<String>
Get the initial seeded html.
Sourcepub fn apply_prompt_configuration(
&mut self,
config: &PromptConfiguration,
) -> &mut Self
Available on crate feature serde only.
pub fn apply_prompt_configuration( &mut self, config: &PromptConfiguration, ) -> &mut Self
serde only.Apply configuration from a PromptConfiguration generated by an LLM.
This method takes a configuration object produced by
RemoteMultimodalEngine::configure_from_prompt() and applies the
settings to this website.
§Example
use spider::features::automation::{RemoteMultimodalEngine, configure_crawler_from_prompt};
let config = configure_crawler_from_prompt(
"http://localhost:11434/v1/chat/completions",
"llama3",
None,
"Crawl blog posts only, respect robots.txt, max 100 pages, 200ms delay"
).await?;
let mut website = Website::new("https://example.com");
website.apply_prompt_configuration(&config);Sourcepub async fn configure_from_prompt(
&mut self,
api_url: &str,
model_name: &str,
api_key: Option<&str>,
prompt: &str,
) -> Result<&mut Self, EngineError>
Available on crate features agent and serde only.
pub async fn configure_from_prompt( &mut self, api_url: &str, model_name: &str, api_key: Option<&str>, prompt: &str, ) -> Result<&mut Self, EngineError>
agent and serde only.Configure the website from a natural language prompt using an LLM.
This is a convenience method that calls the LLM to generate configuration and applies it to the website in one step.
§Arguments
api_url- OpenAI-compatible chat completions endpointmodel_name- Model identifier (e.g., “gpt-4”, “llama3”, “qwen2.5”)api_key- Optional API key for authenticated endpointsprompt- Natural language description of crawling requirements
§Example
let mut website = Website::new("https://example.com");
website.configure_from_prompt(
"http://localhost:11434/v1/chat/completions",
"llama3",
None,
"Only crawl product pages, use 100ms delay, max depth 5, respect robots.txt"
).await?;
website.crawl().await;Requires the agent and serde features.
Trait Implementations§
Source§impl Crawler for Website
impl Crawler for Website
Source§fn status(&self) -> &CrawlStatus
fn status(&self) -> &CrawlStatus
Source§fn links(&self) -> HashSet<CaseInsensitiveString>
fn links(&self) -> HashSet<CaseInsensitiveString>
Source§impl CrawlerSubscription for Website
Available on crate feature sync only.
impl CrawlerSubscription for Website
sync only.Source§impl Error for Website
impl Error for Website
1.30.0 · Source§fn source(&self) -> Option<&(dyn Error + 'static)>
fn source(&self) -> Option<&(dyn Error + 'static)>
1.0.0 · Source§fn description(&self) -> &str
fn description(&self) -> &str
use the Display impl or to_string()
Source§impl Job for Website
impl Job for Website
Source§fn handle<'life0, 'async_trait>(
&'life0 mut self,
) -> Pin<Box<dyn Future<Output = ()> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
fn handle<'life0, 'async_trait>(
&'life0 mut self,
) -> Pin<Box<dyn Future<Output = ()> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
Source§fn is_active(&self) -> bool
fn is_active(&self) -> bool
Source§fn allow_parallel_runs(&self) -> bool
fn allow_parallel_runs(&self) -> bool
Source§fn should_run(&self) -> bool
fn should_run(&self) -> bool
Auto Trait Implementations§
impl Freeze for Website
impl !RefUnwindSafe for Website
impl Send for Website
impl Sync for Website
impl Unpin for Website
impl UnsafeUnpin for Website
impl !UnwindSafe for Website
Blanket Implementations§
Source§impl<T> AsErrorSource for Twhere
T: Error + 'static,
impl<T> AsErrorSource for Twhere
T: Error + 'static,
Source§fn as_error_source(&self) -> &(dyn Error + 'static)
fn as_error_source(&self) -> &(dyn Error + 'static)
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§impl<T> Pointable for T
impl<T> Pointable for T
Source§impl<T> PolicyExt for Twhere
T: ?Sized,
impl<T> PolicyExt for Twhere
T: ?Sized,
Source§impl<T> ToCompactString for Twhere
T: Display,
impl<T> ToCompactString for Twhere
T: Display,
Source§fn try_to_compact_string(&self) -> Result<CompactString, ToCompactStringError>
fn try_to_compact_string(&self) -> Result<CompactString, ToCompactStringError>
ToCompactString::to_compact_string() Read moreSource§fn to_compact_string(&self) -> CompactString
fn to_compact_string(&self) -> CompactString
CompactString. Read moreSource§impl<T> ToStringFallible for Twhere
T: Display,
impl<T> ToStringFallible for Twhere
T: Display,
Source§fn try_to_string(&self) -> Result<String, TryReserveError>
fn try_to_string(&self) -> Result<String, TryReserveError>
ToString::to_string, but without panic on OOM.