Module engine

Expand description

§Engine Module

Implements the core crawling engine that orchestrates the web scraping process.

§Overview

The engine module provides the main Crawler struct and its associated components that manage the entire scraping workflow. It handles requests, responses, items, and coordinates the various subsystems including downloaders, middlewares, parsers, and pipelines.

§Key Components

Crawler: The central orchestrator that manages the crawling lifecycle
Downloader Task: Handles HTTP requests and response retrieval
Parser Task: Processes responses and extracts data according to spider logic
Item Processor: Handles scraped items through registered pipelines
Middleware Manager: Coordinates request/response processing through middlewares

§Architecture

The engine uses an asynchronous, task-based model where different operations run concurrently in separate Tokio tasks. Communication between components happens through async channels, allowing for high-throughput processing.

§Internal Components

These are implementation details and are not typically used directly:

spawn_downloader_task: Creates the task responsible for downloading web pages
spawn_parser_task: Creates the task responsible for parsing responses
spawn_item_processor_task: Creates the task responsible for processing items
SharedMiddlewareManager: Manages concurrent access to middlewares

Structs§

Crawler: The central orchestrator for the web scraping process, handling requests, responses, items, concurrency, checkpointing, and statistics collection.
CrawlerContext: Aggregated context shared across all crawler tasks.

Module engine

Module engine Copy item path

§Engine Module

§Overview

§Key Components

§Architecture

§Internal Components

Structs§

Module engine