pub struct CrawlerBuilder<S, D>where
S: Spider,
D: Downloader,{ /* private fields */ }Expand description
A fluent builder for constructing Crawler instances.
CrawlerBuilder provides a chainable API for configuring all aspects
of a web crawler, including concurrency settings, middleware, pipelines,
and checkpoint options.
§Type Parameters
S: TheSpiderimplementation typeD: TheDownloaderimplementation type
§Example
let builder = CrawlerBuilder::new(MySpider)
.max_concurrent_downloads(8)
.max_parser_workers(4);Implementations§
Source§impl<S> CrawlerBuilder<S, ReqwestClientDownloader>where
S: Spider,
impl<S> CrawlerBuilder<S, ReqwestClientDownloader>where
S: Spider,
Sourcepub fn new(spider: S) -> CrawlerBuilder<S, ReqwestClientDownloader>
pub fn new(spider: S) -> CrawlerBuilder<S, ReqwestClientDownloader>
Creates a new CrawlerBuilder for a given spider with the default ReqwestClientDownloader.
§Example
let crawler = CrawlerBuilder::new(MySpider)
.build()
.await?;Source§impl<S, D> CrawlerBuilder<S, D>where
S: Spider,
D: Downloader,
impl<S, D> CrawlerBuilder<S, D>where
S: Spider,
D: Downloader,
Sourcepub fn max_concurrent_downloads(self, limit: usize) -> CrawlerBuilder<S, D>
pub fn max_concurrent_downloads(self, limit: usize) -> CrawlerBuilder<S, D>
Sets the maximum number of concurrent downloads.
This controls how many HTTP requests can be in-flight simultaneously. Higher values increase throughput but may overwhelm target servers.
§Default
Defaults to the number of CPU cores, with a minimum of 16.
Sourcepub fn max_parser_workers(self, limit: usize) -> CrawlerBuilder<S, D>
pub fn max_parser_workers(self, limit: usize) -> CrawlerBuilder<S, D>
Sourcepub fn max_concurrent_pipelines(self, limit: usize) -> CrawlerBuilder<S, D>
pub fn max_concurrent_pipelines(self, limit: usize) -> CrawlerBuilder<S, D>
Sets the maximum number of concurrent item processing pipelines.
This controls how many items can be processed by pipelines simultaneously.
§Default
Defaults to the number of CPU cores, with a maximum of 8.
Sourcepub fn channel_capacity(self, capacity: usize) -> CrawlerBuilder<S, D>
pub fn channel_capacity(self, capacity: usize) -> CrawlerBuilder<S, D>
Sets the capacity of internal communication channels.
This controls the buffer size for channels between the downloader, parser, and pipeline components. Higher values can improve throughput at the cost of increased memory usage.
§Default
Defaults to 1000.
Sourcepub fn downloader(self, downloader: D) -> CrawlerBuilder<S, D>
pub fn downloader(self, downloader: D) -> CrawlerBuilder<S, D>
Sets a custom downloader implementation.
Use this method to provide a custom Downloader implementation
instead of the default ReqwestClientDownloader.
Sourcepub fn add_middleware<M>(self, middleware: M) -> CrawlerBuilder<S, D>
pub fn add_middleware<M>(self, middleware: M) -> CrawlerBuilder<S, D>
Adds a middleware to the crawler’s middleware stack.
Middlewares intercept and modify requests before they are sent and responses after they are received. They are executed in the order they are added.
§Example
let crawler = CrawlerBuilder::new(MySpider)
.add_middleware(RateLimitMiddleware::default())
.add_middleware(RetryMiddleware::new())
.build()
.await?;Sourcepub fn add_pipeline<P>(self, pipeline: P) -> CrawlerBuilder<S, D>
pub fn add_pipeline<P>(self, pipeline: P) -> CrawlerBuilder<S, D>
Adds a pipeline to the crawler’s pipeline stack.
Pipelines process scraped items after they are extracted by the spider. They can be used for validation, transformation, deduplication, or storage (e.g., writing to files or databases).
§Example
let crawler = CrawlerBuilder::new(MySpider)
.add_pipeline(ConsolePipeline::new())
.add_pipeline(JsonPipeline::new("output.json")?)
.build()
.await?;Sourcepub fn with_checkpoint_path<P>(self, path: P) -> CrawlerBuilder<S, D>
pub fn with_checkpoint_path<P>(self, path: P) -> CrawlerBuilder<S, D>
Sets the path for saving and loading checkpoints.
When enabled, the crawler periodically saves its state to this file, allowing crawls to be resumed after interruption.
Requires the checkpoint feature to be enabled.
Sourcepub fn with_checkpoint_interval(
self,
interval: Duration,
) -> CrawlerBuilder<S, D>
pub fn with_checkpoint_interval( self, interval: Duration, ) -> CrawlerBuilder<S, D>
Sets the interval between automatic checkpoint saves.
When enabled, the crawler saves its state at this interval. Shorter intervals provide more frequent recovery points but may impact performance.
Requires the checkpoint feature to be enabled.
Sourcepub async fn build(
self,
) -> Result<Crawler<S, <D as Downloader>::Client>, SpiderError>
pub async fn build( self, ) -> Result<Crawler<S, <D as Downloader>::Client>, SpiderError>
Builds the Crawler instance.
This method finalizes the crawler configuration and initializes all components. It performs validation and sets up default values where necessary.
§Errors
Returns a SpiderError::ConfigurationError if:
max_concurrent_downloadsis 0parser_workersis 0- No spider was provided to the builder
§Example
let crawler = CrawlerBuilder::new(MySpider)
.max_concurrent_downloads(10)
.build()
.await?;