pub struct CrawlerEngine<'a> {
pub paused: bool,
/* private fields */
}Expand description
The crawler engine – orchestrates the entire crawl loop.
CrawlerEngine is the runtime counterpart to the Spider trait. It owns
all the infrastructure (scheduler, sessions, cache, checkpoint, robots.txt)
and drives the fetch-parse-enqueue cycle. Create one with new,
then call crawl to start processing.
The engine supports graceful pause via request_pause:
the first call initiates a graceful wind-down (waiting for in-flight
requests to finish), and a second call triggers an immediate force stop.
Fields§
§paused: boolWhether the crawl is currently in a paused state.
Implementations§
Source§impl<'a> CrawlerEngine<'a>
impl<'a> CrawlerEngine<'a>
Sourcepub fn new(
spider: &'a dyn Spider,
crawldir: Option<PathBuf>,
interval_secs: f64,
) -> Result<Self>
pub fn new( spider: &'a dyn Spider, crawldir: Option<PathBuf>, interval_secs: f64, ) -> Result<Self>
Creates a new crawler engine for the given spider with optional checkpoint support.
Pass a crawldir path to enable pause/resume checkpointing, or None
to disable it. interval_secs controls how often auto-checkpoints are
saved during the crawl (0.0 disables periodic saves). The spider’s
configure_sessions method is called immediately to populate the
session manager; an error is returned if no sessions are registered.
Sourcepub fn request_pause(&mut self)
pub fn request_pause(&mut self)
Requests a graceful pause of the crawl. On the first call, the engine waits for all in-flight requests to finish before saving a checkpoint and exiting the loop. Calling this a second time triggers a force stop that abandons in-flight requests immediately.
Sourcepub async fn crawl(&mut self) -> Result<CrawlStats>
pub async fn crawl(&mut self) -> Result<CrawlStats>
Runs the main crawl loop and returns aggregate statistics when finished.
This is the primary entry point for executing a crawl. The method blocks
(asynchronously) until the scheduler is empty and all tasks are done, or
until a pause/force-stop is requested. On success it returns the final
CrawlStats; check self.paused to determine whether the crawl
completed or was interrupted.
Sourcepub fn items(&self) -> &ItemList
pub fn items(&self) -> &ItemList
Returns a reference to the collected scraped items. You can call this during or after the crawl to inspect what has been scraped so far.
Sourcepub fn stats(&self) -> &CrawlStats
pub fn stats(&self) -> &CrawlStats
Returns a reference to the current crawl statistics. Like items(),
this is available both during and after the crawl for monitoring
progress.
Sourcepub fn stream(&mut self) -> UnboundedReceiver<Value>
pub fn stream(&mut self) -> UnboundedReceiver<Value>
Creates a streaming receiver that yields items as they are scraped.
Call this before crawl() to get an unbounded
receiver. Each item passes through on_scraped_item() and is sent
to both the receiver and the internal ItemList.
This is the Rust equivalent of Python’s async for item in spider.stream().
§Example
let mut engine = CrawlerEngine::new(&spider, None, 0.0)?;
let mut rx = engine.stream();
// Spawn the crawl in the background
let crawl_handle = tokio::spawn(async move {
engine.crawl().await
});
// Process items as they arrive
while let Some(item) = rx.recv().await {
println!("Got item: {}", item);
}
let stats = crawl_handle.await??;Auto Trait Implementations§
impl<'a> Freeze for CrawlerEngine<'a>
impl<'a> !RefUnwindSafe for CrawlerEngine<'a>
impl<'a> !Send for CrawlerEngine<'a>
impl<'a> !Sync for CrawlerEngine<'a>
impl<'a> Unpin for CrawlerEngine<'a>
impl<'a> UnsafeUnpin for CrawlerEngine<'a>
impl<'a> !UnwindSafe for CrawlerEngine<'a>
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more