spider-pipeline
Provides built-in pipeline implementations for the spider-lib framework.
Overview
The spider-pipeline crate contains a collection of pipeline implementations that process, filter, transform, and store scraped data. Pipelines are the final stage in the crawling process, taking the items extracted by spiders and performing operations like validation, storage, or transformation.
Available Pipelines
- Console Writer: Simple pipeline for printing items to the console (debugging)
- Deduplication: Filters out duplicate items based on configurable keys
- JSON Writer: Collects all items and writes them to a JSON file at the end
- JSONL Writer: Streams items as individual JSON objects to a file
- CSV Exporter: Exports items to CSV format with automatic schema inference
- SQLite Writer: Stores items in a SQLite database with automatic schema creation
- Streaming JSON Writer: Efficiently streams items to JSON without accumulating in memory
Architecture
Each pipeline implements the Pipeline trait, allowing for flexible composition and chaining of processing steps. Multiple pipelines can be attached to a single crawler to process items in different ways simultaneously.
Usage
use JsonWriterPipeline;
use ConsoleWriterPipeline;
// Add pipelines to your crawler
let crawler = new
.add_pipeline
.add_pipeline
.build
.await?;
Pipeline Types
Console Writer
Prints items to the console for debugging purposes.
Configuration:
use ConsoleWriterPipeline;
let console_writer = new;
Deduplication
Filters out duplicate items based on configurable keys to ensure data quality.
Configuration:
use DeduplicationPipeline;
// Deduplicate based on a single field
let dedup_pipeline = new;
// Deduplicate based on multiple fields
let dedup_pipeline = new;
JSON Writer
Collects all items and writes them to a single JSON file at the end of the crawl.
Configuration:
use JsonWriterPipeline;
let json_writer = new?;
JSONL Writer
Streams items as individual JSON objects to a file, one per line, for efficient processing.
Configuration:
use JsonlWriterPipeline;
let jsonl_writer = new?;
CSV Exporter
Exports items to CSV format with automatic schema inference from the data structure.
Configuration:
use CsvExporterPipeline;
let csv_exporter = new?;
SQLite Writer
Stores items in a SQLite database with automatic schema creation based on the item structure.
Configuration:
use SqliteWriterPipeline;
let sqlite_writer = new?;
Streaming JSON Writer
Efficiently streams items to JSON format without accumulating them in memory.
Configuration:
use StreamingJsonWriterPipeline;
// With default batch size (100 items)
let streaming_json_writer = new?;
// With custom batch size
let streaming_json_writer = with_batch_size?;
Dependencies
This crate depends on:
spider-util: For basic data structures and utilities- Various external crates for specific output formats (csv, rusqlite for SQLite, etc.)
License
This project is licensed under the MIT License - see the LICENSE file for details.