spider-pipeline
Provides built-in pipeline implementations for the spider-lib framework.
Overview
The spider-pipeline crate contains a collection of pipeline implementations that process, filter, transform, and store scraped data. Pipelines are the final stage in the crawling process, taking the items extracted by spiders and performing operations like validation, storage, or transformation.
Pipelines are organized using feature flags to prevent bloat. Core pipelines are always available, while advanced features can be enabled as needed.
Available Pipelines
Core Pipelines (Always Available)
- Console Pipeline: Simple pipeline for printing items to the console (debugging)
- Deduplication Pipeline: Filters out duplicate items based on configurable keys
Optional Pipelines (Feature-Gated)
- JSON Pipeline: Collects all items and writes them to a JSON file at the end (feature:
pipeline-json) - JSONL Pipeline: Streams items as individual JSON objects to a file (feature:
pipeline-jsonl) - CSV Pipeline: Exports items to CSV format with automatic schema inference (feature:
pipeline-csv) - SQLite Pipeline: Stores items in a SQLite database with automatic schema creation (feature:
pipeline-sqlite) - Stream JSON Pipeline: Efficiently streams items to JSON without accumulating in memory (feature:
pipeline-stream-json)
Features
This crate uses feature flags to allow selective inclusion of pipeline components:
core(default): Includes core pipeline functionalitypipeline-csv: Enables CSV export capabilitiespipeline-json: Enables JSON writing functionalitypipeline-jsonl: Enables JSONL writing functionalitypipeline-sqlite: Enables SQLite database functionalitypipeline-stream-json: Enables stream JSON functionality
Important Feature Relationships
There are no interdependent features within spider-pipeline. All pipeline features operate independently.
To use only core functionality:
[]
= { = "...", = false, = ["core"] }
To include specific pipelines:
[]
= { = "...", = ["pipeline-csv", "pipeline-json"] }
Architecture
Each pipeline implements the Pipeline trait, allowing for flexible composition and chaining of processing steps. Multiple pipelines can be attached to a single crawler to process items in different ways simultaneously.
Usage
use JsonPipeline;
use ConsolePipeline;
// Add pipelines to your crawler
let crawler = new
.add_pipeline
.add_pipeline
.build
.await?;
Pipeline Types
Console Pipeline
Prints items to the console for debugging purposes.
Configuration:
use ConsolePipeline;
let console_pipeline = new;
Deduplication Pipeline
Filters out duplicate items based on configurable keys to ensure data quality.
Configuration:
use DeduplicationPipeline;
// Deduplicate based on a single field
let dedup_pipeline = new;
// Deduplicate based on multiple fields
let dedup_pipeline = new;
JSON Pipeline
Collects all items and writes them to a single JSON file at the end of the crawl.
Configuration:
use JsonPipeline;
let json_pipeline = new?;
JSONL Pipeline
Streams items as individual JSON objects to a file, one per line, for efficient processing.
Configuration:
use JsonlPipeline;
let jsonl_pipeline = new?;
CSV Pipeline
Exports items to CSV format with automatic schema inference from the data structure.
Configuration:
use CsvPipeline;
let csv_pipeline = new?;
SQLite Pipeline
Stores items in a SQLite database with automatic schema creation based on the item structure.
Configuration:
use SqlitePipeline;
let sqlite_pipeline = new?;
Stream JSON Pipeline
Efficiently streams items to JSON format without accumulating them in memory.
Configuration:
use StreamJsonPipeline;
// With default batch size (100 items)
let stream_json_pipeline = new?;
// With custom batch size
let stream_json_pipeline = with_batch_size?;
License
This project is licensed under the MIT License - see the LICENSE file for details.