Struct CrawlerBuilder

Source

pub struct CrawlerBuilder<S, D>where
    S: Spider,
    D: Downloader,
{ /* private fields */ }

Expand description

A fluent builder for constructing Crawler instances.

CrawlerBuilder provides a chainable API for configuring all aspects of a web crawler, including concurrency settings, middleware, pipelines, and checkpoint options.

§Type Parameters

S: The Spider implementation type
D: The Downloader implementation type

§Example

let builder = CrawlerBuilder::new(MySpider)
    .max_concurrent_downloads(8)
    .max_parser_workers(4);

Implementations§

Source §

impl<S> CrawlerBuilder<S, ReqwestClientDownloader>
where S: Spider,

Source

pub fn new(spider: S) -> CrawlerBuilder<S, ReqwestClientDownloader>

Creates a new CrawlerBuilder for a given spider with the default ReqwestClientDownloader.

§Example

let crawler = CrawlerBuilder::new(MySpider)
    .build()
    .await?;

Source §

impl<S, D> CrawlerBuilder<S, D>
where S: Spider, D: Downloader,

Source

pub fn max_concurrent_downloads(self, limit: usize) -> CrawlerBuilder<S, D>

Sets the maximum number of concurrent downloads.

This controls how many HTTP requests can be in-flight simultaneously. Higher values increase throughput but may overwhelm target servers.

§Default

Defaults to the number of CPU cores, with a minimum of 16.

Source

pub fn max_parser_workers(self, limit: usize) -> CrawlerBuilder<S, D>

Sets the number of worker tasks dedicated to parsing responses.

Parser workers process HTTP responses concurrently, calling the spider’s parse method to extract items and discover new URLs.

§Default

Defaults to the number of CPU cores, clamped between 4 and 16.

Source

pub fn max_concurrent_pipelines(self, limit: usize) -> CrawlerBuilder<S, D>

Sets the maximum number of concurrent item processing pipelines.

This controls how many items can be processed by pipelines simultaneously.

§Default

Defaults to the number of CPU cores, with a maximum of 8.

Source

pub fn channel_capacity(self, capacity: usize) -> CrawlerBuilder<S, D>

Sets the capacity of internal communication channels.

This controls the buffer size for channels between the downloader, parser, and pipeline components. Higher values can improve throughput at the cost of increased memory usage.

§Default

Defaults to 1000.

Source

pub fn downloader(self, downloader: D) -> CrawlerBuilder<S, D>

Sets a custom downloader implementation.

Use this method to provide a custom Downloader implementation instead of the default ReqwestClientDownloader.

Source

pub fn add_middleware<M>(self, middleware: M) -> CrawlerBuilder<S, D>
where M: Middleware<<D as Downloader>::Client> + Send + Sync + 'static,

Adds a middleware to the crawler’s middleware stack.

Middlewares intercept and modify requests before they are sent and responses after they are received. They are executed in the order they are added.

§Example

let crawler = CrawlerBuilder::new(MySpider)
    .add_middleware(RateLimitMiddleware::default())
    .add_middleware(RetryMiddleware::new())
    .build()
    .await?;

Source

pub fn add_pipeline(self, pipeline: P) -> CrawlerBuilder<S, D>
where P: Pipeline<<S as Spider>::Item> + 'static,

Adds a pipeline to the crawler’s pipeline stack.

Pipelines process scraped items after they are extracted by the spider. They can be used for validation, transformation, deduplication, or storage (e.g., writing to files or databases).

§Example

let crawler = CrawlerBuilder::new(MySpider)
    .add_pipeline(ConsolePipeline::new())
    .add_pipeline(JsonPipeline::new("output.json")?)
    .build()
    .await?;

Source

pub fn with_checkpoint_path(self, path: P) -> CrawlerBuilder<S, D>
where P: AsRef<Path>,

Sets the path for saving and loading checkpoints.

When enabled, the crawler periodically saves its state to this file, allowing crawls to be resumed after interruption.

Requires the checkpoint feature to be enabled.

Source

pub fn with_checkpoint_interval( self, interval: Duration, ) -> CrawlerBuilder<S, D>

Sets the interval between automatic checkpoint saves.

When enabled, the crawler saves its state at this interval. Shorter intervals provide more frequent recovery points but may impact performance.

Requires the checkpoint feature to be enabled.

Source

pub async fn build( self, ) -> Result<Crawler<S, <D as Downloader>::Client>, SpiderError>
where D: Downloader + Send + Sync + 'static, <D as Downloader>::Client: Send + Sync + Clone, <S as Spider>::Item: Send + Sync + 'static,

Builds the Crawler instance.

This method finalizes the crawler configuration and initializes all components. It performs validation and sets up default values where necessary.

§Errors

Returns a SpiderError::ConfigurationError if:

max_concurrent_downloads is 0
parser_workers is 0
No spider was provided to the builder

§Example

let crawler = CrawlerBuilder::new(MySpider)
    .max_concurrent_downloads(10)
    .build()
    .await?;

Trait Implementations§

Source §

impl<S> Default for CrawlerBuilder<S, ReqwestClientDownloader>
where S: Spider,

Source §

fn default() -> CrawlerBuilder<S, ReqwestClientDownloader>

Returns the “default value” for a type. Read more

Auto Trait Implementations§

§

impl<S, D> UnsafeUnpin for CrawlerBuilder<S, D>
where D: UnsafeUnpin, S: UnsafeUnpin,

§

impl<S, D> !UnwindSafe for CrawlerBuilder<S, D>

Blanket Implementations§

Source §

impl<T> Any for T
where T: 'static + ?Sized,

Source §

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

Source §

impl<T> Borrow<T> for T
where T: ?Sized,

Source §

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

Source §

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source §

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

Source §

impl<T> From<T> for T

Source §

fn from(t: T) -> T

Returns the argument unchanged.

Source §

impl<T> Instrument for T

Source §

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more

Source §

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more

Source §

impl<T, U> Into for T
where U: From<T>,

Source §

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source §

impl<T> Pointable for T

Source §

const ALIGN: usize

The alignment of pointer.

Source §

type Init = T

The type for initializers.

Source §

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more

Source §

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more

Source §

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more

Source §

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more

Source §

impl<T> PolicyExt for T
where T: ?Sized,

Source §

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow only if self and other return Action::Follow. Read more

Source §

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow if either self or other returns Action::Follow. Read more

Source §

impl<T, U> TryFrom for T
where U: Into<T>,

Source §

type Error = Infallible

The type returned in the event of a conversion error.

Source §

fn try_from(value: U) -> Result<T, <T as TryFrom>::Error>

Performs the conversion.

Source §

impl<T, U> TryInto for T
where U: TryFrom<T>,

Source §

type Error = >::Error

The type returned in the event of a conversion error.

Source §

fn try_into(self) -> Result<U, >::Error>

Performs the conversion.

Source §

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source §

fn vzip(self) -> V

Source §

impl<T> WithSubscriber for T

Source §

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more

Source §

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more

Struct CrawlerBuilder Copy item path

§Type Parameters

§Example

Implementations§

impl<S> CrawlerBuilder<S, ReqwestClientDownloader>where S: Spider,

pub fn new(spider: S) -> CrawlerBuilder<S, ReqwestClientDownloader>

§Example

impl<S, D> CrawlerBuilder<S, D>where S: Spider, D: Downloader,

pub fn max_concurrent_downloads(self, limit: usize) -> CrawlerBuilder<S, D>

§Default

pub fn max_parser_workers(self, limit: usize) -> CrawlerBuilder<S, D>

§Default

pub fn max_concurrent_pipelines(self, limit: usize) -> CrawlerBuilder<S, D>

§Default

pub fn channel_capacity(self, capacity: usize) -> CrawlerBuilder<S, D>

§Default

pub fn downloader(self, downloader: D) -> CrawlerBuilder<S, D>

pub fn add_middleware<M>(self, middleware: M) -> CrawlerBuilder<S, D>where M: Middleware<<D as Downloader>::Client> + Send + Sync + 'static,

§Example

pub fn add_pipeline<P>(self, pipeline: P) -> CrawlerBuilder<S, D>where P: Pipeline<<S as Spider>::Item> + 'static,

§Example

pub fn with_checkpoint_path<P>(self, path: P) -> CrawlerBuilder<S, D>where P: AsRef<Path>,

pub fn with_checkpoint_interval( self, interval: Duration, ) -> CrawlerBuilder<S, D>

pub async fn build( self, ) -> Result<Crawler<S, <D as Downloader>::Client>, SpiderError>where D: Downloader + Send + Sync + 'static, <D as Downloader>::Client: Send + Sync + Clone, <S as Spider>::Item: Send + Sync + 'static,

§Errors

§Example

Trait Implementations§

impl<S> Default for CrawlerBuilder<S, ReqwestClientDownloader>where S: Spider,

fn default() -> CrawlerBuilder<S, ReqwestClientDownloader>

Auto Trait Implementations§

impl<S, D> Freeze for CrawlerBuilder<S, D>where D: Freeze, S: Freeze,

impl<S, D> !RefUnwindSafe for CrawlerBuilder<S, D>

impl<S, D> Send for CrawlerBuilder<S, D>

impl<S, D> Sync for CrawlerBuilder<S, D>

impl<S, D> Unpin for CrawlerBuilder<S, D>where D: Unpin, S: Unpin,

impl<S, D> UnsafeUnpin for CrawlerBuilder<S, D>where D: UnsafeUnpin, S: UnsafeUnpin,

impl<S, D> !UnwindSafe for CrawlerBuilder<S, D>

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T> Instrument for T

fn instrument(self, span: Span) -> Instrumented<Self>

fn in_current_span(self) -> Instrumented<Self>

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> Pointable for T

const ALIGN: usize

type Init = T

unsafe fn init(init: <T as Pointable>::Init) -> usize

unsafe fn deref<'a>(ptr: usize) -> &'a T

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

unsafe fn drop(ptr: usize)

impl<T> PolicyExt for Twhere T: ?Sized,

fn and<P, B, E>(self, other: P) -> And<T, P>where T: Policy<B, E>, P: Policy<B, E>,

fn or<P, B, E>(self, other: P) -> Or<T, P>where T: Policy<B, E>, P: Policy<B, E>,

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

impl<V, T> VZip<V> for Twhere V: MultiLane<T>,

fn vzip(self) -> V

impl<T> WithSubscriber for T

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>where S: Into<Dispatch>,

fn with_current_subscriber(self) -> WithDispatch<Self>

Struct CrawlerBuilder

impl<S> CrawlerBuilder<S, ReqwestClientDownloader>
where S: Spider,

impl<S, D> CrawlerBuilder<S, D>
where S: Spider, D: Downloader,

pub fn add_middleware<M>(self, middleware: M) -> CrawlerBuilder<S, D>
where M: Middleware<<D as Downloader>::Client> + Send + Sync + 'static,

pub fn add_pipeline<P>(self, pipeline: P) -> CrawlerBuilder<S, D>
where P: Pipeline<<S as Spider>::Item> + 'static,

pub fn with_checkpoint_path<P>(self, path: P) -> CrawlerBuilder<S, D>
where P: AsRef<Path>,

pub async fn build( self, ) -> Result<Crawler<S, <D as Downloader>::Client>, SpiderError>
where D: Downloader + Send + Sync + 'static, <D as Downloader>::Client: Send + Sync + Clone, <S as Spider>::Item: Send + Sync + 'static,

impl<S> Default for CrawlerBuilder<S, ReqwestClientDownloader>
where S: Spider,

impl<S, D> Freeze for CrawlerBuilder<S, D>
where D: Freeze, S: Freeze,

impl<S, D> Unpin for CrawlerBuilder<S, D>
where D: Unpin, S: Unpin,

impl<S, D> UnsafeUnpin for CrawlerBuilder<S, D>
where D: UnsafeUnpin, S: UnsafeUnpin,

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T, U> Into<U> for T
where U: From<T>,

impl<T> PolicyExt for T
where T: ?Sized,

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,