pub struct CrawlerEngine<'a> {
pub paused: bool,
/* private fields */
}Expand description
The crawler engine – orchestrates the entire crawl loop.
CrawlerEngine is the runtime counterpart to the Spider trait. It owns
all the infrastructure (scheduler, sessions, cache, checkpoint, robots.txt)
and drives the fetch-parse-enqueue cycle. Create one with new,
then call crawl to start processing.
The engine supports graceful pause via request_pause:
the first call initiates a graceful wind-down (waiting for in-flight
requests to finish), and a second call triggers an immediate force stop.
Fields§
§paused: boolWhether the crawl is currently in a paused state.
Implementations§
Source§impl<'a> CrawlerEngine<'a>
impl<'a> CrawlerEngine<'a>
Sourcepub fn new(
spider: &'a dyn Spider,
crawldir: Option<PathBuf>,
interval_secs: f64,
) -> Result<Self>
pub fn new( spider: &'a dyn Spider, crawldir: Option<PathBuf>, interval_secs: f64, ) -> Result<Self>
Creates a new crawler engine for the given spider with optional checkpoint support.
Pass a crawldir path to enable pause/resume checkpointing, or None
to disable it. interval_secs controls how often auto-checkpoints are
saved during the crawl (0.0 disables periodic saves). The spider’s
configure_sessions method is called immediately to populate the
session manager; an error is returned if no sessions are registered.
Sourcepub fn request_pause(&mut self)
pub fn request_pause(&mut self)
Requests a graceful pause of the crawl. On the first call, the engine waits for all in-flight requests to finish before saving a checkpoint and exiting the loop. Calling this a second time triggers a force stop that abandons in-flight requests immediately.
Sourcepub async fn crawl(&mut self) -> Result<CrawlStats>
pub async fn crawl(&mut self) -> Result<CrawlStats>
Runs the main crawl loop and returns aggregate statistics when finished.
This is the primary entry point for executing a crawl. The method blocks
(asynchronously) until the scheduler is empty and all tasks are done, or
until a pause/force-stop is requested. On success it returns the final
CrawlStats; check self.paused to determine whether the crawl
completed or was interrupted.
Sourcepub fn items(&self) -> &ItemList
pub fn items(&self) -> &ItemList
Returns a reference to the collected scraped items. You can call this during or after the crawl to inspect what has been scraped so far.
Sourcepub fn stats(&self) -> &CrawlStats
pub fn stats(&self) -> &CrawlStats
Returns a reference to the current crawl statistics. Like items(),
this is available both during and after the crawl for monitoring
progress.
Sourcepub fn stream(&mut self) -> UnboundedReceiver<Value>
pub fn stream(&mut self) -> UnboundedReceiver<Value>
Creates a streaming receiver that yields items as they are scraped.
Call this before crawl() to get an unbounded
receiver. Each item passes through on_scraped_item() and is sent
to both the receiver and the internal ItemList.
This is the Rust equivalent of Python’s async for item in spider.stream().
§Example
let mut engine = CrawlerEngine::new(&spider, None, 0.0)?;
let mut rx = engine.stream();
// Spawn the crawl in the background
let crawl_handle = tokio::spawn(async move {
engine.crawl().await
});
// Process items as they arrive
while let Some(item) = rx.recv().await {
println!("Got item: {}", item);
}
let stats = crawl_handle.await??;