Struct CrawlerBuilder

Source

pub struct CrawlerBuilder<S, D>where
    S: Spider,
    D: Downloader,
{ /* private fields */ }

Expand description

Core runtime types and traits used to define and run a crawl. A fluent builder for constructing Crawler instances.

§Type Parameters

S: The Spider implementation type
D: The Downloader implementation type

§Example

let builder = CrawlerBuilder::new(MySpider)
    .max_concurrent_downloads(8)
    .max_pending_requests(16)
    .max_parser_workers(4);

Implementations§

Source §

impl<S> CrawlerBuilder<S, ReqwestClientDownloader>
where S: Spider,

Source

pub fn new(spider: S) -> CrawlerBuilder<S, ReqwestClientDownloader>

Creates a new CrawlerBuilder for a given spider with the default ReqwestClientDownloader.

§Example

let crawler = CrawlerBuilder::new(MySpider)
    .build()
    .await?;

Source §

impl<S, D> CrawlerBuilder<S, D>
where S: Spider, D: Downloader,

Source

pub fn max_concurrent_downloads(self, limit: usize) -> CrawlerBuilder<S, D>

Sets the maximum number of concurrent downloads.

This controls how many HTTP requests can be in-flight simultaneously. Higher values increase throughput but may overwhelm target servers.

§Default

Defaults to twice the number of CPU cores, clamped between 4 and 64.

Source

pub fn max_pending_requests(self, limit: usize) -> CrawlerBuilder<S, D>

Sets the maximum number of outstanding requests tracked by the scheduler.

This includes queued requests plus requests already handed off for download. Lower values keep the frontier tighter and reduce internal request buildup.

Source

pub fn max_parser_workers(self, limit: usize) -> CrawlerBuilder<S, D>

Sets the number of worker tasks dedicated to parsing responses.

Parser workers process HTTP responses concurrently, calling the spider’s parse method to extract items and discover new URLs.

§Default

Defaults to the number of CPU cores, clamped between 4 and 16.

Source

pub fn max_concurrent_pipelines(self, limit: usize) -> CrawlerBuilder<S, D>

Sets the maximum number of concurrent item processing pipelines.

This controls how many items can be processed by pipelines simultaneously.

§Default

Defaults to the number of CPU cores, with a maximum of 8.

Source

pub fn channel_capacity(self, capacity: usize) -> CrawlerBuilder<S, D>

Sets the capacity of internal communication channels.

This controls the buffer size for channels between the downloader, parser, and pipeline components. Higher values can improve throughput at the cost of increased memory usage.

§Default

Defaults to 1000.

Source

pub fn output_batch_size(self, batch_size: usize) -> CrawlerBuilder<S, D>

Sets the parser output batch size.

Larger batches can reduce coordination overhead when pages emit many items or follow-up requests, while smaller batches tend to improve latency and memory locality.

Source

pub fn response_backpressure_threshold( self, threshold: usize, ) -> CrawlerBuilder<S, D>

Sets the downloader response-channel backpressure threshold.

When the downloader-to-parser channel reaches this threshold, the runtime starts applying backpressure so downloaded responses do not pile up unboundedly in memory.

Source

pub fn item_backpressure_threshold( self, threshold: usize, ) -> CrawlerBuilder<S, D>

Sets the parser item-channel backpressure threshold.

This primarily matters when parsing is faster than downstream pipeline processing. Lower thresholds keep memory tighter; higher thresholds let parsers run further ahead.

Source

pub fn retry_release_permit(self, enabled: bool) -> CrawlerBuilder<S, D>

Controls whether retries release downloader permits before waiting.

Enabling this is usually better for throughput because a sleeping retry does not occupy scarce downloader concurrency. Disabling it can be useful when you want retries to count fully against download capacity.

Source

pub fn live_stats(self, enabled: bool) -> CrawlerBuilder<S, D>

Enables or disables live, in-place statistics updates on terminal stdout.

When enabled, spider-* logs are forced to LevelFilter::Off during build to avoid interleaving with the live terminal renderer.

Source

pub fn live_stats_interval(self, interval: Duration) -> CrawlerBuilder<S, D>

Sets the refresh interval for live statistics updates.

Shorter intervals make the terminal view feel more responsive, while longer intervals reduce redraw overhead.

Source

pub fn live_stats_preview_fields( self, fields: impl IntoIterator<Item = impl Into<String>>, ) -> CrawlerBuilder<S, D>

Sets which scraped item fields should be shown in live stats preview.

Field names support dot notation for nested JSON objects such as title, source_url, or metadata.Japanese.

You can also set aliases with label=path, for example url=source_url or jp=metadata.Japanese.

Source

pub fn shutdown_grace_period( self, grace_period: Duration, ) -> CrawlerBuilder<S, D>

Sets the maximum grace period for crawler shutdown before forcing task abort.

This gives pipelines, checkpoint writes, and other in-flight work time to finish cleanly after shutdown begins.

Source

pub fn limit(self, limit: usize) -> CrawlerBuilder<S, D>

Stops the crawl after limit scraped items have been admitted for processing.

This is especially useful for smoke runs, local previews, and documentation examples where you want predictable bounded work.

Source

pub fn downloader(self, downloader: D) -> CrawlerBuilder<S, D>

Sets a custom downloader implementation.

Use this method to provide a custom Downloader implementation instead of the default ReqwestClientDownloader.

Reach for this when transport behavior itself needs to change, such as request signing, alternate HTTP stacks, downloader-level tracing, or protocol-specific request execution.

Source

pub fn add_middleware<M>(self, middleware: M) -> CrawlerBuilder<S, D>
where M: Middleware<<D as Downloader>::Client> + Send + Sync + 'static,

Adds a middleware to the crawler’s middleware stack.

Middlewares intercept and modify requests before they are sent and responses after they are received. They are executed in the order they are added.

Middleware is the right layer for cross-cutting HTTP behavior such as retry policy, rate limiting, cookies, user-agent management, cache lookup, or robots.txt enforcement.

§Example

let crawler = CrawlerBuilder::new(MySpider)
    .add_middleware(RateLimitMiddleware::default())
    .add_middleware(RetryMiddleware::new())
    .build()
    .await?;

Source

pub fn add_pipeline(self, pipeline: P) -> CrawlerBuilder<S, D>
where P: Pipeline<<S as Spider>::Item> + 'static,

Adds a pipeline to the crawler’s pipeline stack.

Pipelines process scraped items after they are extracted by the spider. They can be used for validation, transformation, deduplication, or storage (e.g., writing to files or databases).

Pipelines are ordered. A common pattern is transform first, validate second, deduplicate next, and write to outputs last.

§Example

let crawler = CrawlerBuilder::new(MySpider)
    .add_pipeline(ConsolePipeline::new())
    .add_pipeline(JsonPipeline::new("output.json")?)
    .build()
    .await?;

Source

pub fn log_level(self, level: LevelFilter) -> CrawlerBuilder<S, D>

Sets the log level for spider-* library crates.

This configures the logging level specifically for the spider-lib ecosystem (spider-core, spider-middleware, spider-pipeline, spider-util, spider-downloader). Logs from other dependencies (e.g., reqwest, tokio) will not be affected.

§Log Levels

LevelFilter::Error - Only error messages
LevelFilter::Warn - Warnings and errors
LevelFilter::Info - Informational messages, warnings, and errors
LevelFilter::Debug - Debug messages and above
LevelFilter::Trace - All messages including trace

§Example

use log::LevelFilter;

let crawler = CrawlerBuilder::new(MySpider)
    .log_level(LevelFilter::Debug)
    .build()
    .await?;

Source

pub fn with_checkpoint_path(self, path: P) -> CrawlerBuilder<S, D>
where P: AsRef<Path>,

Sets the path for saving and loading checkpoints.

When enabled, the crawler periodically saves its state to this file, allowing crawls to be resumed after interruption.

Requires the checkpoint feature to be enabled.

If a checkpoint file already exists at build time, the builder will attempt to restore scheduler and pipeline state from it.

Source

pub fn with_checkpoint_interval( self, interval: Duration, ) -> CrawlerBuilder<S, D>

Sets the interval between automatic checkpoint saves.

When enabled, the crawler saves its state at this interval. Shorter intervals provide more frequent recovery points but may impact performance.

Requires the checkpoint feature to be enabled.

Source

pub async fn build( self, ) -> Result<Crawler<S, <D as Downloader>::Client>, SpiderError>
where D: Downloader + Send + Sync + 'static, <D as Downloader>::Client: Send + Sync + Clone, <S as Spider>::Item: Send + Sync + 'static,

Builds the Crawler instance.

This method finalizes the crawler configuration and initializes all components. It performs validation and sets up default values where necessary.

Build time is where the runtime:

validates concurrency and channel settings
initializes logging and live-stats behavior
restores checkpoint state if configured
constructs the scheduler and runtime handles

§Errors

Returns a SpiderError::ConfigurationError if:

max_concurrent_downloads is 0
parser_workers is 0
No spider was provided to the builder

§Example

let crawler = CrawlerBuilder::new(MySpider)
    .max_concurrent_downloads(10)
    .build()
    .await?;

Trait Implementations§

Source §

impl<S> Default for CrawlerBuilder<S, ReqwestClientDownloader>
where S: Spider,

Source §

fn default() -> CrawlerBuilder<S, ReqwestClientDownloader>

Returns the “default value” for a type. Read more

Auto Trait Implementations§

§

impl<S, D> UnsafeUnpin for CrawlerBuilder<S, D>
where D: UnsafeUnpin, S: UnsafeUnpin,

§

impl<S, D> !UnwindSafe for CrawlerBuilder<S, D>

Blanket Implementations§

Source §

impl<T> Any for T
where T: 'static + ?Sized,

Source §

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

Source §

impl<T> Borrow<T> for T
where T: ?Sized,

Source §

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

Source §

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source §

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

Source §

impl<T> From<T> for T

Source §

fn from(t: T) -> T

Returns the argument unchanged.

Source §

impl<T> Instrument for T

Source §

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more

Source §

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more

Source §

impl<T, U> Into for T
where U: From<T>,

Source §

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source §

impl<T> Pointable for T

Source §

const ALIGN: usize

The alignment of pointer.

Source §

type Init = T

The type for initializers.

Source §

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more

Source §

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more

Source §

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more

Source §

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more

Source §

impl<T> PolicyExt for T
where T: ?Sized,

Source §

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow only if self and other return Action::Follow. Read more

Source §

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow if either self or other returns Action::Follow. Read more

Source §

impl<T, U> TryFrom for T
where U: Into<T>,

Source §

type Error = Infallible

The type returned in the event of a conversion error.

Source §

fn try_from(value: U) -> Result<T, <T as TryFrom>::Error>

Performs the conversion.

Source §

impl<T, U> TryInto for T
where U: TryFrom<T>,

Source §

type Error = >::Error

The type returned in the event of a conversion error.

Source §

fn try_into(self) -> Result<U, >::Error>

Performs the conversion.

Source §

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source §

fn vzip(self) -> V

Source §

impl<T> WithSubscriber for T

Source §

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more

Source §

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more

Struct CrawlerBuilder Copy item path

§Type Parameters

§Example

Implementations§

impl<S> CrawlerBuilder<S, ReqwestClientDownloader>where S: Spider,

pub fn new(spider: S) -> CrawlerBuilder<S, ReqwestClientDownloader>

§Example

impl<S, D> CrawlerBuilder<S, D>where S: Spider, D: Downloader,

pub fn max_concurrent_downloads(self, limit: usize) -> CrawlerBuilder<S, D>

§Default

pub fn max_pending_requests(self, limit: usize) -> CrawlerBuilder<S, D>

pub fn max_parser_workers(self, limit: usize) -> CrawlerBuilder<S, D>

§Default

pub fn max_concurrent_pipelines(self, limit: usize) -> CrawlerBuilder<S, D>

§Default

pub fn channel_capacity(self, capacity: usize) -> CrawlerBuilder<S, D>

§Default

pub fn output_batch_size(self, batch_size: usize) -> CrawlerBuilder<S, D>

pub fn response_backpressure_threshold( self, threshold: usize, ) -> CrawlerBuilder<S, D>

pub fn item_backpressure_threshold( self, threshold: usize, ) -> CrawlerBuilder<S, D>

pub fn retry_release_permit(self, enabled: bool) -> CrawlerBuilder<S, D>

pub fn live_stats(self, enabled: bool) -> CrawlerBuilder<S, D>

pub fn live_stats_interval(self, interval: Duration) -> CrawlerBuilder<S, D>

pub fn live_stats_preview_fields( self, fields: impl IntoIterator<Item = impl Into<String>>, ) -> CrawlerBuilder<S, D>

pub fn shutdown_grace_period( self, grace_period: Duration, ) -> CrawlerBuilder<S, D>

pub fn limit(self, limit: usize) -> CrawlerBuilder<S, D>

pub fn downloader(self, downloader: D) -> CrawlerBuilder<S, D>

pub fn add_middleware<M>(self, middleware: M) -> CrawlerBuilder<S, D>where M: Middleware<<D as Downloader>::Client> + Send + Sync + 'static,

§Example

pub fn add_pipeline<P>(self, pipeline: P) -> CrawlerBuilder<S, D>where P: Pipeline<<S as Spider>::Item> + 'static,

§Example

pub fn log_level(self, level: LevelFilter) -> CrawlerBuilder<S, D>

§Log Levels

§Example

pub fn with_checkpoint_path<P>(self, path: P) -> CrawlerBuilder<S, D>where P: AsRef<Path>,

pub fn with_checkpoint_interval( self, interval: Duration, ) -> CrawlerBuilder<S, D>

pub async fn build( self, ) -> Result<Crawler<S, <D as Downloader>::Client>, SpiderError>where D: Downloader + Send + Sync + 'static, <D as Downloader>::Client: Send + Sync + Clone, <S as Spider>::Item: Send + Sync + 'static,

§Errors

§Example

Trait Implementations§

impl<S> Default for CrawlerBuilder<S, ReqwestClientDownloader>where S: Spider,

fn default() -> CrawlerBuilder<S, ReqwestClientDownloader>

Auto Trait Implementations§

impl<S, D> Freeze for CrawlerBuilder<S, D>where D: Freeze, S: Freeze,

impl<S, D> !RefUnwindSafe for CrawlerBuilder<S, D>

impl<S, D> Send for CrawlerBuilder<S, D>

impl<S, D> Sync for CrawlerBuilder<S, D>

impl<S, D> Unpin for CrawlerBuilder<S, D>where D: Unpin, S: Unpin,

impl<S, D> UnsafeUnpin for CrawlerBuilder<S, D>where D: UnsafeUnpin, S: UnsafeUnpin,

impl<S, D> !UnwindSafe for CrawlerBuilder<S, D>

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T> Instrument for T

fn instrument(self, span: Span) -> Instrumented<Self>

fn in_current_span(self) -> Instrumented<Self>

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> Pointable for T

const ALIGN: usize

type Init = T

unsafe fn init(init: <T as Pointable>::Init) -> usize

unsafe fn deref<'a>(ptr: usize) -> &'a T

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

unsafe fn drop(ptr: usize)

impl<T> PolicyExt for Twhere T: ?Sized,

fn and<P, B, E>(self, other: P) -> And<T, P>where T: Policy<B, E>, P: Policy<B, E>,

fn or<P, B, E>(self, other: P) -> Or<T, P>where T: Policy<B, E>, P: Policy<B, E>,

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Struct CrawlerBuilder

impl<S> CrawlerBuilder<S, ReqwestClientDownloader>
where S: Spider,

impl<S, D> CrawlerBuilder<S, D>
where S: Spider, D: Downloader,

pub fn add_middleware<M>(self, middleware: M) -> CrawlerBuilder<S, D>
where M: Middleware<<D as Downloader>::Client> + Send + Sync + 'static,

pub fn add_pipeline<P>(self, pipeline: P) -> CrawlerBuilder<S, D>
where P: Pipeline<<S as Spider>::Item> + 'static,

pub fn with_checkpoint_path<P>(self, path: P) -> CrawlerBuilder<S, D>
where P: AsRef<Path>,

pub async fn build( self, ) -> Result<Crawler<S, <D as Downloader>::Client>, SpiderError>
where D: Downloader + Send + Sync + 'static, <D as Downloader>::Client: Send + Sync + Clone, <S as Spider>::Item: Send + Sync + 'static,

impl<S> Default for CrawlerBuilder<S, ReqwestClientDownloader>
where S: Spider,

impl<S, D> Freeze for CrawlerBuilder<S, D>
where D: Freeze, S: Freeze,

impl<S, D> Unpin for CrawlerBuilder<S, D>
where D: Unpin, S: Unpin,

impl<S, D> UnsafeUnpin for CrawlerBuilder<S, D>
where D: UnsafeUnpin, S: UnsafeUnpin,

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T, U> Into<U> for T
where U: From<T>,

impl<T> PolicyExt for T
where T: ?Sized,

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,