pub struct CrawlerBuilder<S, D>where
S: Spider,
D: Downloader,{ /* private fields */ }Expand description
Core runtime types and traits used to define and run a crawl.
A fluent builder for constructing Crawler instances.
§Type Parameters
S: TheSpiderimplementation typeD: TheDownloaderimplementation type
§Example
let builder = CrawlerBuilder::new(MySpider)
.max_concurrent_downloads(8)
.max_pending_requests(16)
.max_parser_workers(4);Implementations§
Source§impl<S> CrawlerBuilder<S, ReqwestClientDownloader>where
S: Spider,
impl<S> CrawlerBuilder<S, ReqwestClientDownloader>where
S: Spider,
Sourcepub fn new(spider: S) -> CrawlerBuilder<S, ReqwestClientDownloader>
pub fn new(spider: S) -> CrawlerBuilder<S, ReqwestClientDownloader>
Creates a new CrawlerBuilder for a given spider with the default ReqwestClientDownloader.
§Example
let crawler = CrawlerBuilder::new(MySpider)
.build()
.await?;Source§impl<S, D> CrawlerBuilder<S, D>where
S: Spider,
D: Downloader,
impl<S, D> CrawlerBuilder<S, D>where
S: Spider,
D: Downloader,
Sourcepub fn max_concurrent_downloads(self, limit: usize) -> CrawlerBuilder<S, D>
pub fn max_concurrent_downloads(self, limit: usize) -> CrawlerBuilder<S, D>
Sets the maximum number of concurrent downloads.
This controls how many HTTP requests can be in-flight simultaneously. Higher values increase throughput but may overwhelm target servers.
§Default
Defaults to twice the number of CPU cores, clamped between 4 and 64.
Sourcepub fn max_pending_requests(self, limit: usize) -> CrawlerBuilder<S, D>
pub fn max_pending_requests(self, limit: usize) -> CrawlerBuilder<S, D>
Sets the maximum number of outstanding requests tracked by the scheduler.
This includes queued requests plus requests already handed off for download. Lower values keep the frontier tighter and reduce internal request buildup.
Sourcepub fn max_parser_workers(self, limit: usize) -> CrawlerBuilder<S, D>
pub fn max_parser_workers(self, limit: usize) -> CrawlerBuilder<S, D>
Sourcepub fn max_concurrent_pipelines(self, limit: usize) -> CrawlerBuilder<S, D>
pub fn max_concurrent_pipelines(self, limit: usize) -> CrawlerBuilder<S, D>
Sets the maximum number of concurrent item processing pipelines.
This controls how many items can be processed by pipelines simultaneously.
§Default
Defaults to the number of CPU cores, with a maximum of 8.
Sourcepub fn channel_capacity(self, capacity: usize) -> CrawlerBuilder<S, D>
pub fn channel_capacity(self, capacity: usize) -> CrawlerBuilder<S, D>
Sets the capacity of internal communication channels.
This controls the buffer size for channels between the downloader, parser, and pipeline components. Higher values can improve throughput at the cost of increased memory usage.
§Default
Defaults to 1000.
Sourcepub fn output_batch_size(self, batch_size: usize) -> CrawlerBuilder<S, D>
pub fn output_batch_size(self, batch_size: usize) -> CrawlerBuilder<S, D>
Sets the parser output batch size.
Larger batches can reduce coordination overhead when pages emit many items or follow-up requests, while smaller batches tend to improve latency and memory locality.
Sourcepub fn response_backpressure_threshold(
self,
threshold: usize,
) -> CrawlerBuilder<S, D>
pub fn response_backpressure_threshold( self, threshold: usize, ) -> CrawlerBuilder<S, D>
Sets the downloader response-channel backpressure threshold.
When the downloader-to-parser channel reaches this threshold, the runtime starts applying backpressure so downloaded responses do not pile up unboundedly in memory.
Sourcepub fn item_backpressure_threshold(
self,
threshold: usize,
) -> CrawlerBuilder<S, D>
pub fn item_backpressure_threshold( self, threshold: usize, ) -> CrawlerBuilder<S, D>
Sets the parser item-channel backpressure threshold.
This primarily matters when parsing is faster than downstream pipeline processing. Lower thresholds keep memory tighter; higher thresholds let parsers run further ahead.
Sourcepub fn retry_release_permit(self, enabled: bool) -> CrawlerBuilder<S, D>
pub fn retry_release_permit(self, enabled: bool) -> CrawlerBuilder<S, D>
Controls whether retries release downloader permits before waiting.
Enabling this is usually better for throughput because a sleeping retry does not occupy scarce downloader concurrency. Disabling it can be useful when you want retries to count fully against download capacity.
Sourcepub fn live_stats(self, enabled: bool) -> CrawlerBuilder<S, D>
pub fn live_stats(self, enabled: bool) -> CrawlerBuilder<S, D>
Enables or disables live, in-place statistics updates on terminal stdout.
When enabled, spider-* logs are forced to LevelFilter::Off during build
to avoid interleaving with the live terminal renderer.
Sourcepub fn live_stats_interval(self, interval: Duration) -> CrawlerBuilder<S, D>
pub fn live_stats_interval(self, interval: Duration) -> CrawlerBuilder<S, D>
Sets the refresh interval for live statistics updates.
Shorter intervals make the terminal view feel more responsive, while longer intervals reduce redraw overhead.
Sourcepub fn live_stats_preview_fields(
self,
fields: impl IntoIterator<Item = impl Into<String>>,
) -> CrawlerBuilder<S, D>
pub fn live_stats_preview_fields( self, fields: impl IntoIterator<Item = impl Into<String>>, ) -> CrawlerBuilder<S, D>
Sets which scraped item fields should be shown in live stats preview.
Field names support dot notation for nested JSON objects such as
title, source_url, or metadata.Japanese.
You can also set aliases with label=path, for example
url=source_url or jp=metadata.Japanese.
Sourcepub fn shutdown_grace_period(
self,
grace_period: Duration,
) -> CrawlerBuilder<S, D>
pub fn shutdown_grace_period( self, grace_period: Duration, ) -> CrawlerBuilder<S, D>
Sets the maximum grace period for crawler shutdown before forcing task abort.
This gives pipelines, checkpoint writes, and other in-flight work time to finish cleanly after shutdown begins.
Sourcepub fn limit(self, limit: usize) -> CrawlerBuilder<S, D>
pub fn limit(self, limit: usize) -> CrawlerBuilder<S, D>
Stops the crawl after limit scraped items have been admitted for processing.
This is especially useful for smoke runs, local previews, and documentation examples where you want predictable bounded work.
Sourcepub fn downloader(self, downloader: D) -> CrawlerBuilder<S, D>
pub fn downloader(self, downloader: D) -> CrawlerBuilder<S, D>
Sets a custom downloader implementation.
Use this method to provide a custom Downloader implementation
instead of the default ReqwestClientDownloader.
Reach for this when transport behavior itself needs to change, such as request signing, alternate HTTP stacks, downloader-level tracing, or protocol-specific request execution.
Sourcepub fn add_middleware<M>(self, middleware: M) -> CrawlerBuilder<S, D>
pub fn add_middleware<M>(self, middleware: M) -> CrawlerBuilder<S, D>
Adds a middleware to the crawler’s middleware stack.
Middlewares intercept and modify requests before they are sent and responses after they are received. They are executed in the order they are added.
Middleware is the right layer for cross-cutting HTTP behavior such as
retry policy, rate limiting, cookies, user-agent management, cache
lookup, or robots.txt enforcement.
§Example
let crawler = CrawlerBuilder::new(MySpider)
.add_middleware(RateLimitMiddleware::default())
.add_middleware(RetryMiddleware::new())
.build()
.await?;Sourcepub fn add_pipeline<P>(self, pipeline: P) -> CrawlerBuilder<S, D>
pub fn add_pipeline<P>(self, pipeline: P) -> CrawlerBuilder<S, D>
Adds a pipeline to the crawler’s pipeline stack.
Pipelines process scraped items after they are extracted by the spider. They can be used for validation, transformation, deduplication, or storage (e.g., writing to files or databases).
Pipelines are ordered. A common pattern is transform first, validate second, deduplicate next, and write to outputs last.
§Example
let crawler = CrawlerBuilder::new(MySpider)
.add_pipeline(ConsolePipeline::new())
.add_pipeline(JsonPipeline::new("output.json")?)
.build()
.await?;Sourcepub fn log_level(self, level: LevelFilter) -> CrawlerBuilder<S, D>
pub fn log_level(self, level: LevelFilter) -> CrawlerBuilder<S, D>
Sets the log level for spider-* library crates.
This configures the logging level specifically for the spider-lib ecosystem (spider-core, spider-middleware, spider-pipeline, spider-util, spider-downloader). Logs from other dependencies (e.g., reqwest, tokio) will not be affected.
§Log Levels
LevelFilter::Error- Only error messagesLevelFilter::Warn- Warnings and errorsLevelFilter::Info- Informational messages, warnings, and errorsLevelFilter::Debug- Debug messages and aboveLevelFilter::Trace- All messages including trace
§Example
use log::LevelFilter;
let crawler = CrawlerBuilder::new(MySpider)
.log_level(LevelFilter::Debug)
.build()
.await?;Sourcepub fn with_checkpoint_path<P>(self, path: P) -> CrawlerBuilder<S, D>
pub fn with_checkpoint_path<P>(self, path: P) -> CrawlerBuilder<S, D>
Sets the path for saving and loading checkpoints.
When enabled, the crawler periodically saves its state to this file, allowing crawls to be resumed after interruption.
Requires the checkpoint feature to be enabled.
If a checkpoint file already exists at build time, the builder will attempt to restore scheduler and pipeline state from it.
Sourcepub fn with_checkpoint_interval(
self,
interval: Duration,
) -> CrawlerBuilder<S, D>
pub fn with_checkpoint_interval( self, interval: Duration, ) -> CrawlerBuilder<S, D>
Sets the interval between automatic checkpoint saves.
When enabled, the crawler saves its state at this interval. Shorter intervals provide more frequent recovery points but may impact performance.
Requires the checkpoint feature to be enabled.
Sourcepub async fn build(
self,
) -> Result<Crawler<S, <D as Downloader>::Client>, SpiderError>
pub async fn build( self, ) -> Result<Crawler<S, <D as Downloader>::Client>, SpiderError>
Builds the Crawler instance.
This method finalizes the crawler configuration and initializes all components. It performs validation and sets up default values where necessary.
Build time is where the runtime:
- validates concurrency and channel settings
- initializes logging and live-stats behavior
- restores checkpoint state if configured
- constructs the scheduler and runtime handles
§Errors
Returns a SpiderError::ConfigurationError if:
max_concurrent_downloadsis 0parser_workersis 0- No spider was provided to the builder
§Example
let crawler = CrawlerBuilder::new(MySpider)
.max_concurrent_downloads(10)
.build()
.await?;