kumo 0.4.0

An async web crawling framework for Rust - Scrapy for Rust
Documentation

kumo

CI Docs

An async web crawling framework for Rust - Scrapy for Rust.

kumo means spider/cloud in Japanese. It gives you a trait-based, async-first API for writing spiders that scrape, follow links, and store data.

Why Kumo?

kumo Scrapy (Python) Colly (Go)
Language Rust Python Go
Type safety Compile-time Runtime Partial
Async model Tokio (true async) Twisted (event loop) goroutines
Memory safety Guaranteed GC GC
CSS / XPath / Regex / JSONPath Yes Yes CSS only
#[derive(Extract)] macro Yes No No
LLM extraction (Claude / OpenAI / Gemini / Ollama) Yes No No
Browser / JS rendering Yes (chromiumoxide) Yes (Playwright) No
Stealth mode (TLS/HTTP2 fingerprint spoofing) Yes No No
Distributed frontier (Redis) Yes Yes (scrapy-redis) No
Item stream API Yes No No
Typed crawl events Yes Signals No
OpenTelemetry export Yes No No
Pluggable stores (JSONL, CSV, Postgres, SQLite, MySQL) Yes Yes (pipelines) No
Single binary deploy Yes No Yes
Binary size / startup Small / instant Large / slow Small / fast

Benchmark results - 1,000 books, concurrency 16, median of 3 runs:

kumo Colly (Go) Scrapy (Python)
Real site - Items/s 76.7 73.5 53.3
Local server - Items/s 12,346 4,098 180
Peak RSS 12.5 MB 31.4 MB 77.2 MB

On raw parsing throughput (local server, no network): 3.0x faster than Colly, 69x faster than Scrapy. See the benchmark folder for full methodology and reproduction steps.

Features

  • Async-first - Tokio-based with bounded concurrency via JoinSet
  • CSS selectors - res.css(".selector") backed by scraper
  • XPath selectors - res.xpath("//h1/text()") for XML/HTML documents (feature: xpath)
  • Regex selectors - res.re(r"\d+"), el.re_first(r"..."), works on Response, Element, and ElementList
  • JSONPath selectors - res.jsonpath("$.store.books[*].title") for JSON responses (feature: jsonpath)
  • #[derive(Extract)] - generate CSS extraction boilerplate from field annotations (feature: derive)
  • Rate limiting - token-bucket RateLimiter via governor
  • Auto-throttle - adaptive delay based on EWMA latency and 429/503 back-off
  • Retry with backoff - exponential backoff, status filtering, jitter, and Retry-After support
  • Item stream - CrawlEngine::stream() returns an async Stream for real-time item consumption
  • Typed crawl events - CrawlEvent lifecycle hooks for dashboards, progress bars, alerts, and embedded runners
  • robots.txt - per-domain fetch + cache, enabled by default, with Crawl-delay support
  • Bloom filter dedup - O(1) URL deduplication, 1M URLs at 0.1% false-positive rate
  • Request scheduling - CrawlRequest supports priority, headers, method/body, metadata, and dont_filter
  • HTTP cache - disk-backed response cache via .http_cache(dir), optional TTL
  • Link extractor - LinkExtractor with allow/deny regex, allow_domains, canonicalize, restrict_css
  • Pluggable storage - JsonlStore, JsonStore, CsvStore, StdoutStore, PostgreSQL, SQLite, MySQL
  • Middleware chain - proxy rotation, custom headers, status retry, rate limiting, auto-throttle
  • Domain + depth filtering - allowed_domains() and max_depth() on the Spider trait
  • Multi-spider engine - run multiple spiders concurrently via .add_spider() / .run_all()
  • LLM extraction - extract structured data without selectors using Claude, OpenAI, Gemini, or Ollama
  • Browser fetcher - headless Chromium via chromiumoxide for JS-rendered pages (feature: browser)
  • Distributed frontier - Redis-backed URL frontier for multi-process crawls (feature: redis-frontier)
  • Persistent frontier - file-backed URL frontier that survives restarts and exposes recovered state (feature: persistence)
  • Sitemap spider - SitemapSpider reads sitemap.xml and sitemap index files
  • Metrics - periodic stats snapshots via tracing::info! using .metrics_interval()
  • OpenTelemetry - OTLP/gRPC export of all spans to Jaeger, Grafana Tempo, Datadog, etc. (feature: otel)

Installation

[dependencies]
kumo = "0.4"
async-trait = "0.1"
serde = { version = "1", features = ["derive"] }
tokio = { version = "1", features = ["full"] }

For #[derive(Extract)]:

kumo = { version = "0.4", features = ["derive"] }

Quick Start

use kumo::prelude::*;
use serde::Serialize;

#[derive(Debug, Serialize)]
struct Quote {
    text: String,
    author: String,
}

struct QuotesSpider;

#[async_trait::async_trait]
impl Spider for QuotesSpider {
    type Item = Quote;

    fn name(&self) -> &str { "quotes" }

    fn start_urls(&self) -> Vec<String> {
        vec!["https://quotes.toscrape.com".into()]
    }

    async fn parse(&self, res: &Response) -> Result<Output<Self::Item>, KumoError> {
        let quotes: Vec<Quote> = res.css(".quote").iter().map(|el| Quote {
            text:   el.css(".text").first().map(|e| e.text()).unwrap_or_default(),
            author: el.css(".author").first().map(|e| e.text()).unwrap_or_default(),
        }).collect();

        let next = res.css("li.next a").first()
            .and_then(|el| el.attr("href"))
            .map(|href| res.urljoin(&href));

        let mut output = Output::new().items(quotes);
        if let Some(url) = next { output = output.follow(url); }
        Ok(output)
    }
}

#[tokio::main]
async fn main() -> Result<(), KumoError> {
    CrawlEngine::builder()
        .concurrency(5)
        .middleware(DefaultHeaders::new().user_agent("kumo/0.4"))
        .store(JsonlStore::new("quotes.jsonl")?)
        .run(QuotesSpider)
        .await?;
    Ok(())
}

For more examples - production crawl controls, rate limiting, database stores, LLM extraction, browser mode, and all selector types - see the examples/ folder.

Documentation

Full documentation at kumo.wihlarkop.com

Contributing

# Install lefthook (one-time setup)
# macOS
brew install lefthook

# Windows
scoop install lefthook
# or: winget install lefthook

# Linux
curl -1sLf 'https://dl.lefthook.dev/setup.sh' | sh

# Activate the hooks
lefthook install

After lefthook install, every git commit will automatically run cargo fmt (auto-fix) and cargo clippy before the commit goes through.

License

MIT