Expand description
§Engine Module
Implements the core crawling engine that orchestrates the web scraping process.
§Overview
The engine module provides the main Crawler struct and its associated
components that manage the entire scraping workflow. It handles requests,
responses, items, and coordinates the various subsystems including
downloaders, middlewares, parsers, and pipelines.
§Key Components
- Crawler: The central orchestrator that manages the crawling lifecycle
- Downloader Task: Handles HTTP requests and response retrieval
- Parser Task: Processes responses and extracts data according to spider logic
- Item Processor: Handles scraped items through registered pipelines
- Middleware Manager: Coordinates request/response processing through middlewares
§Architecture
The engine uses an asynchronous, task-based model where different operations run concurrently in separate Tokio tasks. Communication between components happens through async channels, allowing for high-throughput processing.
§Internal Components
These are implementation details and are not typically used directly:
spawn_downloader_task: Creates the task responsible for downloading web pagesspawn_parser_task: Creates the task responsible for parsing responsesspawn_item_processor_task: Creates the task responsible for processing itemsSharedMiddlewareManager: Manages concurrent access to middlewares
Structs§
- Crawler
- The central orchestrator for the web scraping process, handling requests, responses, items, concurrency, checkpointing, and statistics collection.
- Crawler
Context - Aggregated context shared across all crawler tasks.