spider-pipeline 0.3.5

Pipeline implementations for the spider-lib web scraping framework.
Documentation

spider-pipeline

Item pipelines for processing, filtering, and exporting scraped data in spider-lib.

Use this crate directly when you need pipeline functionality without taking the full facade crate.

Installation

[dependencies]
spider-pipeline = "0.3.4"

Pipeline Catalog

Core (always available)

  • TransformPipeline: normalize/transform item fields.
  • ValidationPipeline: enforce field and type rules.
  • DeduplicationPipeline: drop duplicate items by key fields.
  • ConsolePipeline: print processed items for visibility/debugging.

Optional output pipelines (feature-gated)

  • pipeline-json -> JsonPipeline
  • pipeline-jsonl -> JsonlPipeline
  • pipeline-csv -> CsvPipeline
  • pipeline-sqlite -> SqlitePipeline
  • pipeline-stream-json -> StreamJsonPipeline

Core Composition Example

use spider_pipeline::{
    console::ConsolePipeline,
    dedup::DeduplicationPipeline,
    transform::{TransformOperation, TransformPipeline},
    validation::{ValidationPipeline, ValidationRule},
};

let crawler = spider_core::CrawlerBuilder::new(MySpider)
    .add_pipeline(
        TransformPipeline::new()
            .with_operation(TransformOperation::Trim { field: "title".into() }),
    )
    .add_pipeline(
        ValidationPipeline::new()
            .with_rule("title", ValidationRule::Required)
            .with_rule("title", ValidationRule::NonEmptyString),
    )
    .add_pipeline(DeduplicationPipeline::new(&["url"]))
    .add_pipeline(ConsolePipeline::new())
    .build()
    .await?;

Build a Custom Pipeline

Use a custom pipeline when your processing logic is domain-specific (custom scoring, external API enrichment, bespoke filtering, etc.).

use async_trait::async_trait;
use spider_pipeline::pipeline::Pipeline;
use spider_util::{error::PipelineError, item::ScrapedItem};

struct EnrichPipeline;

#[async_trait]
impl<I: ScrapedItem> Pipeline<I> for EnrichPipeline {
    fn name(&self) -> &str {
        "enrich_pipeline"
    }

    async fn process_item(&self, item: I) -> Result<Option<I>, PipelineError> {
        // Enrich, validate, or drop item by returning Ok(None).
        Ok(Some(item))
    }
}

Runtime integration:

let crawler = spider_core::CrawlerBuilder::new(MySpider)
    .add_pipeline(EnrichPipeline)
    .build()
    .await?;

Optional Output Pipelines (One by One)

pipeline-json (JsonPipeline)

[dependencies]
spider-lib = { version = "3.0.0", features = ["pipeline-json"] }
use spider_lib::prelude::*;

let crawler = CrawlerBuilder::new(MySpider)
    .add_pipeline(JsonPipeline::new("output/items.json")?)
    .build()
    .await?;

pipeline-jsonl (JsonlPipeline)

[dependencies]
spider-lib = { version = "3.0.0", features = ["pipeline-jsonl"] }
use spider_lib::prelude::*;

let crawler = CrawlerBuilder::new(MySpider)
    .add_pipeline(JsonlPipeline::new("output/items.jsonl")?)
    .build()
    .await?;

pipeline-csv (CsvPipeline)

[dependencies]
spider-lib = { version = "3.0.0", features = ["pipeline-csv"] }
use spider_lib::prelude::*;

let crawler = CrawlerBuilder::new(MySpider)
    .add_pipeline(CsvPipeline::new("output/items.csv")?)
    .build()
    .await?;

pipeline-sqlite (SqlitePipeline)

[dependencies]
spider-lib = { version = "3.0.0", features = ["pipeline-sqlite"] }
use spider_lib::prelude::*;

let crawler = CrawlerBuilder::new(MySpider)
    .add_pipeline(SqlitePipeline::new("output/items.db", "items")?)
    .build()
    .await?;

pipeline-stream-json (StreamJsonPipeline)

[dependencies]
spider-lib = { version = "3.0.0", features = ["pipeline-stream-json"] }
use spider_lib::prelude::*;

let crawler = CrawlerBuilder::new(MySpider)
    .add_pipeline(StreamJsonPipeline::new("output/items-stream.json")?)
    .build()
    .await?;

Pipeline Strategy

A common production sequence:

  1. TransformPipeline for cleanup.
  2. ValidationPipeline for schema checks.
  3. DeduplicationPipeline to control duplicates.
  4. One or more output pipelines (JsonlPipeline, CsvPipeline, etc.).

Feature Flags

  • core (default)
  • pipeline-csv
  • pipeline-json
  • pipeline-jsonl
  • pipeline-sqlite
  • pipeline-stream-json
[dependencies]
spider-pipeline = { version = "0.3.4", features = ["pipeline-jsonl", "pipeline-csv"] }

When using via spider-lib, enable root features with the same names.

Related Crates

License

MIT. See LICENSE.