Skip to main content

CrawlerBuilder

Struct CrawlerBuilder 

Source
pub struct CrawlerBuilder<S, D>
where S: Spider, D: Downloader,
{ /* private fields */ }
Expand description

A fluent builder for constructing Crawler instances.

CrawlerBuilder provides a chainable API for configuring all aspects of a web crawler, including concurrency settings, middleware, pipelines, and checkpoint options.

§Type Parameters

§Example

let builder = CrawlerBuilder::new(MySpider)
    .max_concurrent_downloads(8)
    .max_parser_workers(4);

Implementations§

Source§

impl<S> CrawlerBuilder<S, ReqwestClientDownloader>
where S: Spider,

Source

pub fn new(spider: S) -> CrawlerBuilder<S, ReqwestClientDownloader>

Creates a new CrawlerBuilder for a given spider with the default ReqwestClientDownloader.

§Example
let crawler = CrawlerBuilder::new(MySpider)
    .build()
    .await?;
Source§

impl<S, D> CrawlerBuilder<S, D>
where S: Spider, D: Downloader,

Source

pub fn max_concurrent_downloads(self, limit: usize) -> CrawlerBuilder<S, D>

Sets the maximum number of concurrent downloads.

This controls how many HTTP requests can be in-flight simultaneously. Higher values increase throughput but may overwhelm target servers.

§Default

Defaults to the number of CPU cores, with a minimum of 16.

Source

pub fn max_parser_workers(self, limit: usize) -> CrawlerBuilder<S, D>

Sets the number of worker tasks dedicated to parsing responses.

Parser workers process HTTP responses concurrently, calling the spider’s parse method to extract items and discover new URLs.

§Default

Defaults to the number of CPU cores, clamped between 4 and 16.

Source

pub fn max_concurrent_pipelines(self, limit: usize) -> CrawlerBuilder<S, D>

Sets the maximum number of concurrent item processing pipelines.

This controls how many items can be processed by pipelines simultaneously.

§Default

Defaults to the number of CPU cores, with a maximum of 8.

Source

pub fn channel_capacity(self, capacity: usize) -> CrawlerBuilder<S, D>

Sets the capacity of internal communication channels.

This controls the buffer size for channels between the downloader, parser, and pipeline components. Higher values can improve throughput at the cost of increased memory usage.

§Default

Defaults to 1000.

Source

pub fn downloader(self, downloader: D) -> CrawlerBuilder<S, D>

Sets a custom downloader implementation.

Use this method to provide a custom Downloader implementation instead of the default ReqwestClientDownloader.

Source

pub fn add_middleware<M>(self, middleware: M) -> CrawlerBuilder<S, D>
where M: Middleware<<D as Downloader>::Client> + Send + Sync + 'static,

Adds a middleware to the crawler’s middleware stack.

Middlewares intercept and modify requests before they are sent and responses after they are received. They are executed in the order they are added.

§Example
let crawler = CrawlerBuilder::new(MySpider)
    .add_middleware(RateLimitMiddleware::default())
    .add_middleware(RetryMiddleware::new())
    .build()
    .await?;
Source

pub fn add_pipeline<P>(self, pipeline: P) -> CrawlerBuilder<S, D>
where P: Pipeline<<S as Spider>::Item> + 'static,

Adds a pipeline to the crawler’s pipeline stack.

Pipelines process scraped items after they are extracted by the spider. They can be used for validation, transformation, deduplication, or storage (e.g., writing to files or databases).

§Example
let crawler = CrawlerBuilder::new(MySpider)
    .add_pipeline(ConsolePipeline::new())
    .add_pipeline(JsonPipeline::new("output.json")?)
    .build()
    .await?;
Source

pub fn with_checkpoint_path<P>(self, path: P) -> CrawlerBuilder<S, D>
where P: AsRef<Path>,

Sets the path for saving and loading checkpoints.

When enabled, the crawler periodically saves its state to this file, allowing crawls to be resumed after interruption.

Requires the checkpoint feature to be enabled.

Source

pub fn with_checkpoint_interval( self, interval: Duration, ) -> CrawlerBuilder<S, D>

Sets the interval between automatic checkpoint saves.

When enabled, the crawler saves its state at this interval. Shorter intervals provide more frequent recovery points but may impact performance.

Requires the checkpoint feature to be enabled.

Source

pub async fn build( self, ) -> Result<Crawler<S, <D as Downloader>::Client>, SpiderError>
where D: Downloader + Send + Sync + 'static, <D as Downloader>::Client: Send + Sync + Clone, <S as Spider>::Item: Send + Sync + 'static,

Builds the Crawler instance.

This method finalizes the crawler configuration and initializes all components. It performs validation and sets up default values where necessary.

§Errors

Returns a SpiderError::ConfigurationError if:

  • max_concurrent_downloads is 0
  • parser_workers is 0
  • No spider was provided to the builder
§Example
let crawler = CrawlerBuilder::new(MySpider)
    .max_concurrent_downloads(10)
    .build()
    .await?;

Trait Implementations§

Source§

impl<S> Default for CrawlerBuilder<S, ReqwestClientDownloader>
where S: Spider,

Source§

fn default() -> CrawlerBuilder<S, ReqwestClientDownloader>

Returns the “default value” for a type. Read more

Auto Trait Implementations§

§

impl<S, D> Freeze for CrawlerBuilder<S, D>
where D: Freeze, S: Freeze,

§

impl<S, D> !RefUnwindSafe for CrawlerBuilder<S, D>

§

impl<S, D> Send for CrawlerBuilder<S, D>

§

impl<S, D> Sync for CrawlerBuilder<S, D>

§

impl<S, D> Unpin for CrawlerBuilder<S, D>
where D: Unpin, S: Unpin,

§

impl<S, D> UnsafeUnpin for CrawlerBuilder<S, D>
where D: UnsafeUnpin, S: UnsafeUnpin,

§

impl<S, D> !UnwindSafe for CrawlerBuilder<S, D>

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> PolicyExt for T
where T: ?Sized,

Source§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow only if self and other return Action::Follow. Read more
Source§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow if either self or other returns Action::Follow. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more