Skip to main content

Module engine

Module engine 

Source
Expand description

§Engine Module

Implements the core crawling engine that orchestrates the web scraping process.

§Overview

The engine module provides the main Crawler struct and its associated components that manage the entire scraping workflow. It handles requests, responses, items, and coordinates the various subsystems including downloaders, middlewares, parsers, and pipelines.

§Key Components

  • Crawler: The central orchestrator that manages the crawling lifecycle
  • Downloader Task: Handles HTTP requests and response retrieval
  • Parser Task: Processes responses and extracts data according to spider logic
  • Item Processor: Handles scraped items through registered pipelines
  • Middleware Manager: Coordinates request/response processing through middlewares

§Architecture

The engine uses an asynchronous, task-based model where different operations run concurrently in separate Tokio tasks. Communication between components happens through async channels, allowing for high-throughput processing.

§Internal Components

These are implementation details and are not typically used directly:

  • spawn_downloader_task: Creates the task responsible for downloading web pages
  • spawn_parser_task: Creates the task responsible for parsing responses
  • spawn_item_processor_task: Creates the task responsible for processing items
  • SharedMiddlewareManager: Manages concurrent access to middlewares

Structs§

Crawler
The central orchestrator for the web scraping process, handling requests, responses, items, concurrency, checkpointing, and statistics collection.
CrawlerContext
Aggregated context shared across all crawler tasks.