pub struct CrawlerBuilder<S: Spider, D>where
D: Downloader,{ /* private fields */ }Expand description
Implementations§
Source§impl<S: Spider> CrawlerBuilder<S, ReqwestClientDownloader>
impl<S: Spider> CrawlerBuilder<S, ReqwestClientDownloader>
Sourcepub fn new(spider: S) -> Self
pub fn new(spider: S) -> Self
Creates a new CrawlerBuilder for a given spider with the default ReqwestClientDownloader.
§Example
let crawler = CrawlerBuilder::new(MySpider)
.build()
.await?;Source§impl<S: Spider, D: Downloader> CrawlerBuilder<S, D>
impl<S: Spider, D: Downloader> CrawlerBuilder<S, D>
Sourcepub fn max_concurrent_downloads(self, limit: usize) -> Self
pub fn max_concurrent_downloads(self, limit: usize) -> Self
Sets the maximum number of concurrent downloads.
This controls how many HTTP requests can be in-flight simultaneously. Higher values increase throughput but may overwhelm target servers.
§Default
Defaults to twice the number of CPU cores, clamped between 4 and 64.
Sourcepub fn max_pending_requests(self, limit: usize) -> Self
pub fn max_pending_requests(self, limit: usize) -> Self
Sets the maximum number of outstanding requests tracked by the scheduler.
This includes queued requests plus requests already handed off for download. Lower values keep the frontier tighter and reduce internal request buildup.
Sourcepub fn max_parser_workers(self, limit: usize) -> Self
pub fn max_parser_workers(self, limit: usize) -> Self
Sourcepub fn max_concurrent_pipelines(self, limit: usize) -> Self
pub fn max_concurrent_pipelines(self, limit: usize) -> Self
Sets the maximum number of concurrent item processing pipelines.
This controls how many items can be processed by pipelines simultaneously.
§Default
Defaults to the number of CPU cores, with a maximum of 8.
Sourcepub fn channel_capacity(self, capacity: usize) -> Self
pub fn channel_capacity(self, capacity: usize) -> Self
Sets the capacity of internal communication channels.
This controls the buffer size for channels between the downloader, parser, and pipeline components. Higher values can improve throughput at the cost of increased memory usage.
§Default
Defaults to 1000.
Sourcepub fn output_batch_size(self, batch_size: usize) -> Self
pub fn output_batch_size(self, batch_size: usize) -> Self
Sets the parser output batch size.
Larger batches can reduce coordination overhead when pages emit many items or follow-up requests, while smaller batches tend to improve latency and memory locality.
Sourcepub fn response_backpressure_threshold(self, threshold: usize) -> Self
pub fn response_backpressure_threshold(self, threshold: usize) -> Self
Sets the downloader response-channel backpressure threshold.
When the downloader-to-parser channel reaches this threshold, the runtime starts applying backpressure so downloaded responses do not pile up unboundedly in memory.
Sourcepub fn item_backpressure_threshold(self, threshold: usize) -> Self
pub fn item_backpressure_threshold(self, threshold: usize) -> Self
Sets the parser item-channel backpressure threshold.
This primarily matters when parsing is faster than downstream pipeline processing. Lower thresholds keep memory tighter; higher thresholds let parsers run further ahead.
Sourcepub fn retry_release_permit(self, enabled: bool) -> Self
pub fn retry_release_permit(self, enabled: bool) -> Self
Controls whether retries release downloader permits before waiting.
Enabling this is usually better for throughput because a sleeping retry does not occupy scarce downloader concurrency. Disabling it can be useful when you want retries to count fully against download capacity.
Sourcepub fn live_stats(self, enabled: bool) -> Self
pub fn live_stats(self, enabled: bool) -> Self
Enables or disables live, in-place statistics updates on terminal stdout.
When enabled, spider-* logs are forced to LevelFilter::Off during build
to avoid interleaving with the live terminal renderer.
Sourcepub fn live_stats_interval(self, interval: Duration) -> Self
pub fn live_stats_interval(self, interval: Duration) -> Self
Sets the refresh interval for live statistics updates.
Shorter intervals make the terminal view feel more responsive, while longer intervals reduce redraw overhead.
Sourcepub fn live_stats_preview_fields(
self,
fields: impl IntoIterator<Item = impl Into<String>>,
) -> Self
pub fn live_stats_preview_fields( self, fields: impl IntoIterator<Item = impl Into<String>>, ) -> Self
Sets which scraped item fields should be shown in live stats preview.
Field names support dot notation for nested JSON objects such as
title, source_url, or metadata.Japanese.
You can also set aliases with label=path, for example
url=source_url or jp=metadata.Japanese.
Sourcepub fn shutdown_grace_period(self, grace_period: Duration) -> Self
pub fn shutdown_grace_period(self, grace_period: Duration) -> Self
Sets the maximum grace period for crawler shutdown before forcing task abort.
This gives pipelines, checkpoint writes, and other in-flight work time to finish cleanly after shutdown begins.
Sourcepub fn limit(self, limit: usize) -> Self
pub fn limit(self, limit: usize) -> Self
Stops the crawl after limit scraped items have been admitted for processing.
This is especially useful for smoke runs, local previews, and documentation examples where you want predictable bounded work.
Sourcepub fn downloader(self, downloader: D) -> Self
pub fn downloader(self, downloader: D) -> Self
Sets a custom downloader implementation.
Use this method to provide a custom Downloader implementation
instead of the default ReqwestClientDownloader.
Reach for this when transport behavior itself needs to change, such as request signing, alternate HTTP stacks, downloader-level tracing, or protocol-specific request execution.
Sourcepub fn add_middleware<M>(self, middleware: M) -> Self
pub fn add_middleware<M>(self, middleware: M) -> Self
Adds a middleware to the crawler’s middleware stack.
Middlewares intercept and modify requests before they are sent and responses after they are received. They are executed in the order they are added.
Middleware is the right layer for cross-cutting HTTP behavior such as
retry policy, rate limiting, cookies, user-agent management, cache
lookup, or robots.txt enforcement.
§Example
let crawler = CrawlerBuilder::new(MySpider)
.add_middleware(RateLimitMiddleware::default())
.add_middleware(RetryMiddleware::new())
.build()
.await?;Sourcepub fn add_pipeline<P>(self, pipeline: P) -> Self
pub fn add_pipeline<P>(self, pipeline: P) -> Self
Adds a pipeline to the crawler’s pipeline stack.
Pipelines process scraped items after they are extracted by the spider. They can be used for validation, transformation, deduplication, or storage (e.g., writing to files or databases).
Pipelines are ordered. A common pattern is transform first, validate second, deduplicate next, and write to outputs last.
§Example
let crawler = CrawlerBuilder::new(MySpider)
.add_pipeline(ConsolePipeline::new())
.add_pipeline(JsonPipeline::new("output.json")?)
.build()
.await?;Sourcepub fn log_level(self, level: LevelFilter) -> Self
pub fn log_level(self, level: LevelFilter) -> Self
Sets the log level for spider-* library crates.
This configures the logging level specifically for the spider-lib ecosystem (spider-core, spider-middleware, spider-pipeline, spider-util, spider-downloader). Logs from other dependencies (e.g., reqwest, tokio) will not be affected.
§Log Levels
LevelFilter::Error- Only error messagesLevelFilter::Warn- Warnings and errorsLevelFilter::Info- Informational messages, warnings, and errorsLevelFilter::Debug- Debug messages and aboveLevelFilter::Trace- All messages including trace
§Example
use log::LevelFilter;
let crawler = CrawlerBuilder::new(MySpider)
.log_level(LevelFilter::Debug)
.build()
.await?;Sourcepub fn with_checkpoint_path<P: AsRef<Path>>(self, path: P) -> Self
pub fn with_checkpoint_path<P: AsRef<Path>>(self, path: P) -> Self
Sets the path for saving and loading checkpoints.
When enabled, the crawler periodically saves its state to this file, allowing crawls to be resumed after interruption.
Requires the checkpoint feature to be enabled.
If a checkpoint file already exists at build time, the builder will attempt to restore scheduler and pipeline state from it.
Sourcepub fn with_checkpoint_interval(self, interval: Duration) -> Self
pub fn with_checkpoint_interval(self, interval: Duration) -> Self
Sets the interval between automatic checkpoint saves.
When enabled, the crawler saves its state at this interval. Shorter intervals provide more frequent recovery points but may impact performance.
Requires the checkpoint feature to be enabled.
Sourcepub async fn build(self) -> Result<Crawler<S, D::Client>, SpiderError>
pub async fn build(self) -> Result<Crawler<S, D::Client>, SpiderError>
Builds the Crawler instance.
This method finalizes the crawler configuration and initializes all components. It performs validation and sets up default values where necessary.
Build time is where the runtime:
- validates concurrency and channel settings
- initializes logging and live-stats behavior
- restores checkpoint state if configured
- constructs the scheduler and runtime handles
§Errors
Returns a SpiderError::ConfigurationError if:
max_concurrent_downloadsis 0parser_workersis 0- No spider was provided to the builder
§Example
let crawler = CrawlerBuilder::new(MySpider)
.max_concurrent_downloads(10)
.build()
.await?;