spider-pipeline 0.3.7

Pipeline implementations for the spider-lib web scraping framework.
Documentation

spider-pipeline

spider-pipeline is where scraped items get cleaned up, validated, filtered, and written out. Pipelines sit after parsing, so this crate is a good fit for data-shaping work that should stay separate from page extraction logic.

For normal app code you can enable pipelines through spider-lib. This crate is more useful when you want to compose pipelines directly or ship reusable item-processing stages.

When to use it directly

Use spider-pipeline if you want to:

  • build or publish custom pipelines
  • compose processing stages without the root facade crate
  • depend on output backends directly from lower-level runtime code

Installation

[dependencies]
spider-pipeline = "0.3.7"

Built-in pipelines

Always available:

Type Purpose
TransformPipeline Apply field-level transformations.
ValidationPipeline Enforce field presence, type, and value rules.
DeduplicationPipeline Drop duplicate items by selected fields.
ConsolePipeline Log items for visibility and debugging.

Feature-gated output pipelines:

Feature Type Output
pipeline-json JsonPipeline JSON array file
pipeline-jsonl JsonlPipeline One JSON object per line
pipeline-csv CsvPipeline CSV file
pipeline-sqlite SqlitePipeline SQLite database
pipeline-stream-json StreamJsonPipeline Streaming JSON output

Composition example

use spider_pipeline::{
    console::ConsolePipeline,
    dedup::DeduplicationPipeline,
    transform::{TransformOperation, TransformPipeline},
    validation::{ValidationPipeline, ValidationRule},
};

let crawler = spider_core::CrawlerBuilder::new(MySpider)
    .add_pipeline(
        TransformPipeline::new()
            .with_operation(TransformOperation::Trim { field: "title".into() }),
    )
    .add_pipeline(
        ValidationPipeline::new()
            .with_rule("title", ValidationRule::Required)
            .with_rule("title", ValidationRule::NonEmptyString),
    )
    .add_pipeline(DeduplicationPipeline::new(&["url"]))
    .add_pipeline(ConsolePipeline::new())
    .build()
    .await?;

That ordering is a good default: clean first, validate second, deduplicate next, then export or log.

Pipeline contract in one sentence

Each pipeline receives an item, optionally mutates it, either forwards it with Ok(Some(item)) or drops it with Ok(None), and may persist side effects along the way.

Custom pipeline example

use async_trait::async_trait;
use spider_pipeline::pipeline::Pipeline;
use spider_util::{error::PipelineError, item::ScrapedItem};

struct EnrichPipeline;

#[async_trait]
impl<I: ScrapedItem> Pipeline<I> for EnrichPipeline {
    fn name(&self) -> &str {
        "enrich_pipeline"
    }

    async fn process_item(&self, item: I) -> Result<Option<I>, PipelineError> {
        Ok(Some(item))
    }
}

Runnable example in this repo

If you want to see a real pipeline setup instead of a skeleton, run:

cargo run --example books_live --features "live-stats pipeline-csv"

That example uses CsvPipeline and writes output to output/books_live.csv.

When to choose a pipeline instead of middleware

Choose a pipeline when the concern is about scraped items after parsing. Choose middleware when the concern is about requests, responses, retries, or other HTTP lifecycle behavior.

Feature flags

[dependencies]
spider-pipeline = { version = "0.3.7", features = ["pipeline-jsonl", "pipeline-csv"] }

When used through the root crate, enable the same feature names on spider-lib.

Related crates

License

MIT. See LICENSE.