pub struct Scheduler { /* private fields */ }Expand description
Manages the request queue and tracks visited URLs to prevent duplicate crawling.
The Scheduler is responsible for:
- Maintaining a queue of pending requests
- Tracking which URLs have been visited using a Bloom Filter and LRU cache
- Providing backpressure when too many requests are pending
- Supporting checkpoint-based state restoration
§Architecture
The scheduler runs as a separate async task and communicates via message passing. This design ensures thread-safe access without requiring explicit locks.
§Duplicate Detection
The scheduler uses a two-tier approach for duplicate detection:
- Bloom Filter: Fast, memory-efficient probabilistic check (may have false positives)
- LRU Cache: Definitive check with TTL-based eviction
Requests are first checked against the Bloom Filter. If it indicates a possible duplicate, the LRU cache is consulted for confirmation.
Implementations§
Source§impl Scheduler
impl Scheduler
Sourcepub fn new(
_initial_state: Option<()>,
) -> (Arc<Scheduler>, AsyncReceiver<Request>)
pub fn new( _initial_state: Option<()>, ) -> (Arc<Scheduler>, AsyncReceiver<Request>)
Creates a new Scheduler and returns a tuple containing the scheduler and a request receiver.
This method initializes the scheduler with optional checkpoint state for resuming interrupted crawls. When checkpoint data is provided, the scheduler restores:
- Pending request queue
- Visited URL cache
- Salvaged requests
The scheduler spawns two background tasks:
- Bloom Filter Flush Task: Periodically flushes the URL fingerprint buffer
- Run Loop Task: Processes incoming messages and dispatches requests
§Parameters
initial_state: Optional checkpoint state to restore from a previous crawl
§Returns
A tuple of (Arc<Scheduler>, AsyncReceiver<Request>):
Arc<Scheduler>: Thread-safe reference to the scheduler for sending commandsAsyncReceiver<Request>: Channel receiver for consuming scheduled requests
§Example
let (scheduler, request_rx) = Scheduler::new(None::<()>);pub async fn snapshot(&self) -> Result<(), SpiderError>
pub async fn enqueue_request(&self, request: Request) -> Result<(), SpiderError>
Sourcepub async fn shutdown(&self) -> Result<(), SpiderError>
pub async fn shutdown(&self) -> Result<(), SpiderError>
Signals the scheduler loop to stop processing new work.
§Errors
Returns an error when the shutdown message cannot be sent.
Sourcepub async fn mark_visited(&self, fingerprint: String) -> Result<(), SpiderError>
pub async fn mark_visited(&self, fingerprint: String) -> Result<(), SpiderError>
Marks a single fingerprint as visited.
§Errors
Returns an error when the message cannot be delivered to the scheduler loop.
Sourcepub async fn mark_visited_batch(
&self,
fingerprints: Vec<String>,
) -> Result<(), SpiderError>
pub async fn mark_visited_batch( &self, fingerprints: Vec<String>, ) -> Result<(), SpiderError>
Marks multiple fingerprints as visited in one message.
If fingerprints is empty, this method returns immediately.
§Errors
Returns an error when the batch message cannot be delivered to the scheduler loop.
Sourcepub fn is_visited(&self, fingerprint: &str) -> bool
pub fn is_visited(&self, fingerprint: &str) -> bool
Returns true if fingerprint has already been visited.
Sourcepub fn should_enqueue(&self, request: &Request) -> bool
pub fn should_enqueue(&self, request: &Request) -> bool
Returns true when request has not been visited yet.