Skip to main content

CrawlerBuilder

Struct CrawlerBuilder 

Source
pub struct CrawlerBuilder<S: Spider, D>
where D: Downloader,
{ /* private fields */ }
Expand description

A fluent builder for constructing Crawler instances.

§Type Parameters

§Example

let builder = CrawlerBuilder::new(MySpider)
    .max_concurrent_downloads(8)
    .max_pending_requests(16)
    .max_parser_workers(4);

Implementations§

Source§

impl<S: Spider> CrawlerBuilder<S, ReqwestClientDownloader>

Source

pub fn new(spider: S) -> Self

Creates a new CrawlerBuilder for a given spider with the default ReqwestClientDownloader.

§Example
let crawler = CrawlerBuilder::new(MySpider)
    .build()
    .await?;
Source§

impl<S: Spider, D: Downloader> CrawlerBuilder<S, D>

Source

pub fn max_concurrent_downloads(self, limit: usize) -> Self

Sets the maximum number of concurrent downloads.

This controls how many HTTP requests can be in-flight simultaneously. Higher values increase throughput but may overwhelm target servers.

§Default

Defaults to twice the number of CPU cores, clamped between 4 and 64.

Source

pub fn max_pending_requests(self, limit: usize) -> Self

Sets the maximum number of outstanding requests tracked by the scheduler.

This includes queued requests plus requests already handed off for download. Lower values keep the frontier tighter and reduce internal request buildup.

Source

pub fn max_parser_workers(self, limit: usize) -> Self

Sets the number of worker tasks dedicated to parsing responses.

Parser workers process HTTP responses concurrently, calling the spider’s parse method to extract items and discover new URLs.

§Default

Defaults to the number of CPU cores, clamped between 4 and 16.

Source

pub fn max_concurrent_pipelines(self, limit: usize) -> Self

Sets the maximum number of concurrent item processing pipelines.

This controls how many items can be processed by pipelines simultaneously.

§Default

Defaults to the number of CPU cores, with a maximum of 8.

Source

pub fn channel_capacity(self, capacity: usize) -> Self

Sets the capacity of internal communication channels.

This controls the buffer size for channels between the downloader, parser, and pipeline components. Higher values can improve throughput at the cost of increased memory usage.

§Default

Defaults to 1000.

Source

pub fn output_batch_size(self, batch_size: usize) -> Self

Sets the parser output batch size.

Larger batches can reduce coordination overhead when pages emit many items or follow-up requests, while smaller batches tend to improve latency and memory locality.

Source

pub fn response_backpressure_threshold(self, threshold: usize) -> Self

Sets the downloader response-channel backpressure threshold.

When the downloader-to-parser channel reaches this threshold, the runtime starts applying backpressure so downloaded responses do not pile up unboundedly in memory.

Source

pub fn item_backpressure_threshold(self, threshold: usize) -> Self

Sets the parser item-channel backpressure threshold.

This primarily matters when parsing is faster than downstream pipeline processing. Lower thresholds keep memory tighter; higher thresholds let parsers run further ahead.

Source

pub fn retry_release_permit(self, enabled: bool) -> Self

Controls whether retries release downloader permits before waiting.

Enabling this is usually better for throughput because a sleeping retry does not occupy scarce downloader concurrency. Disabling it can be useful when you want retries to count fully against download capacity.

Source

pub fn live_stats(self, enabled: bool) -> Self

Enables or disables live, in-place statistics updates on terminal stdout.

When enabled, spider-* logs are forced to LevelFilter::Off during build to avoid interleaving with the live terminal renderer.

Source

pub fn live_stats_interval(self, interval: Duration) -> Self

Sets the refresh interval for live statistics updates.

Shorter intervals make the terminal view feel more responsive, while longer intervals reduce redraw overhead.

Source

pub fn live_stats_preview_fields( self, fields: impl IntoIterator<Item = impl Into<String>>, ) -> Self

Sets which scraped item fields should be shown in live stats preview.

Field names support dot notation for nested JSON objects such as title, source_url, or metadata.Japanese.

You can also set aliases with label=path, for example url=source_url or jp=metadata.Japanese.

Source

pub fn shutdown_grace_period(self, grace_period: Duration) -> Self

Sets the maximum grace period for crawler shutdown before forcing task abort.

This gives pipelines, checkpoint writes, and other in-flight work time to finish cleanly after shutdown begins.

Source

pub fn limit(self, limit: usize) -> Self

Stops the crawl after limit scraped items have been admitted for processing.

This is especially useful for smoke runs, local previews, and documentation examples where you want predictable bounded work.

Source

pub fn downloader(self, downloader: D) -> Self

Sets a custom downloader implementation.

Use this method to provide a custom Downloader implementation instead of the default ReqwestClientDownloader.

Reach for this when transport behavior itself needs to change, such as request signing, alternate HTTP stacks, downloader-level tracing, or protocol-specific request execution.

Source

pub fn add_middleware<M>(self, middleware: M) -> Self
where M: Middleware<D::Client> + Send + Sync + 'static,

Adds a middleware to the crawler’s middleware stack.

Middlewares intercept and modify requests before they are sent and responses after they are received. They are executed in the order they are added.

Middleware is the right layer for cross-cutting HTTP behavior such as retry policy, rate limiting, cookies, user-agent management, cache lookup, or robots.txt enforcement.

§Example
let crawler = CrawlerBuilder::new(MySpider)
    .add_middleware(RateLimitMiddleware::default())
    .add_middleware(RetryMiddleware::new())
    .build()
    .await?;
Source

pub fn add_pipeline<P>(self, pipeline: P) -> Self
where P: Pipeline<S::Item> + 'static,

Adds a pipeline to the crawler’s pipeline stack.

Pipelines process scraped items after they are extracted by the spider. They can be used for validation, transformation, deduplication, or storage (e.g., writing to files or databases).

Pipelines are ordered. A common pattern is transform first, validate second, deduplicate next, and write to outputs last.

§Example
let crawler = CrawlerBuilder::new(MySpider)
    .add_pipeline(ConsolePipeline::new())
    .add_pipeline(JsonPipeline::new("output.json")?)
    .build()
    .await?;
Source

pub fn log_level(self, level: LevelFilter) -> Self

Sets the log level for spider-* library crates.

This configures the logging level specifically for the spider-lib ecosystem (spider-core, spider-middleware, spider-pipeline, spider-util, spider-downloader). Logs from other dependencies (e.g., reqwest, tokio) will not be affected.

§Log Levels
  • LevelFilter::Error - Only error messages
  • LevelFilter::Warn - Warnings and errors
  • LevelFilter::Info - Informational messages, warnings, and errors
  • LevelFilter::Debug - Debug messages and above
  • LevelFilter::Trace - All messages including trace
§Example
use log::LevelFilter;

let crawler = CrawlerBuilder::new(MySpider)
    .log_level(LevelFilter::Debug)
    .build()
    .await?;
Source

pub fn with_checkpoint_path<P: AsRef<Path>>(self, path: P) -> Self

Sets the path for saving and loading checkpoints.

When enabled, the crawler periodically saves its state to this file, allowing crawls to be resumed after interruption.

Requires the checkpoint feature to be enabled.

If a checkpoint file already exists at build time, the builder will attempt to restore scheduler and pipeline state from it.

Source

pub fn with_checkpoint_interval(self, interval: Duration) -> Self

Sets the interval between automatic checkpoint saves.

When enabled, the crawler saves its state at this interval. Shorter intervals provide more frequent recovery points but may impact performance.

Requires the checkpoint feature to be enabled.

Source

pub async fn build(self) -> Result<Crawler<S, D::Client>, SpiderError>
where D: Downloader + Send + Sync + 'static, D::Client: Send + Sync + Clone, S::Item: Send + Sync + 'static,

Builds the Crawler instance.

This method finalizes the crawler configuration and initializes all components. It performs validation and sets up default values where necessary.

Build time is where the runtime:

  • validates concurrency and channel settings
  • initializes logging and live-stats behavior
  • restores checkpoint state if configured
  • constructs the scheduler and runtime handles
§Errors

Returns a SpiderError::ConfigurationError if:

  • max_concurrent_downloads is 0
  • parser_workers is 0
  • No spider was provided to the builder
§Example
let crawler = CrawlerBuilder::new(MySpider)
    .max_concurrent_downloads(10)
    .build()
    .await?;

Trait Implementations§

Source§

impl<S: Spider> Default for CrawlerBuilder<S, ReqwestClientDownloader>

Source§

fn default() -> Self

Returns the “default value” for a type. Read more

Auto Trait Implementations§

§

impl<S, D> Freeze for CrawlerBuilder<S, D>
where D: Freeze, S: Freeze,

§

impl<S, D> !RefUnwindSafe for CrawlerBuilder<S, D>

§

impl<S, D> Send for CrawlerBuilder<S, D>

§

impl<S, D> Sync for CrawlerBuilder<S, D>

§

impl<S, D> Unpin for CrawlerBuilder<S, D>
where D: Unpin, S: Unpin,

§

impl<S, D> UnsafeUnpin for CrawlerBuilder<S, D>
where D: UnsafeUnpin, S: UnsafeUnpin,

§

impl<S, D> !UnwindSafe for CrawlerBuilder<S, D>

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> PolicyExt for T
where T: ?Sized,

Source§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow only if self and other return Action::Follow. Read more
Source§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow if either self or other returns Action::Follow. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more