spider-pipeline
Item pipelines for processing, filtering, and exporting scraped data in spider-lib.
Use this crate directly when you need pipeline functionality without taking the full facade crate.
Installation
[]
= "0.3.4"
Pipeline Catalog
Core (always available)
TransformPipeline: normalize/transform item fields.ValidationPipeline: enforce field and type rules.DeduplicationPipeline: drop duplicate items by key fields.ConsolePipeline: print processed items for visibility/debugging.
Optional output pipelines (feature-gated)
pipeline-json->JsonPipelinepipeline-jsonl->JsonlPipelinepipeline-csv->CsvPipelinepipeline-sqlite->SqlitePipelinepipeline-stream-json->StreamJsonPipeline
Core Composition Example
use ;
let crawler = new
.add_pipeline
.add_pipeline
.add_pipeline
.add_pipeline
.build
.await?;
Build a Custom Pipeline
Use a custom pipeline when your processing logic is domain-specific (custom scoring, external API enrichment, bespoke filtering, etc.).
use async_trait;
use Pipeline;
use ;
;
Runtime integration:
let crawler = new
.add_pipeline
.build
.await?;
Optional Output Pipelines (One by One)
pipeline-json (JsonPipeline)
[]
= { = "3.0.0", = ["pipeline-json"] }
use *;
let crawler = new
.add_pipeline
.build
.await?;
pipeline-jsonl (JsonlPipeline)
[]
= { = "3.0.0", = ["pipeline-jsonl"] }
use *;
let crawler = new
.add_pipeline
.build
.await?;
pipeline-csv (CsvPipeline)
[]
= { = "3.0.0", = ["pipeline-csv"] }
use *;
let crawler = new
.add_pipeline
.build
.await?;
pipeline-sqlite (SqlitePipeline)
[]
= { = "3.0.0", = ["pipeline-sqlite"] }
use *;
let crawler = new
.add_pipeline
.build
.await?;
pipeline-stream-json (StreamJsonPipeline)
[]
= { = "3.0.0", = ["pipeline-stream-json"] }
use *;
let crawler = new
.add_pipeline
.build
.await?;
Pipeline Strategy
A common production sequence:
TransformPipelinefor cleanup.ValidationPipelinefor schema checks.DeduplicationPipelineto control duplicates.- One or more output pipelines (
JsonlPipeline,CsvPipeline, etc.).
Feature Flags
core(default)pipeline-csvpipeline-jsonpipeline-jsonlpipeline-sqlitepipeline-stream-json
[]
= { = "0.3.4", = ["pipeline-jsonl", "pipeline-csv"] }
When using via spider-lib, enable root features with the same names.
Related Crates
License
MIT. See LICENSE.