kumo

An async web crawling framework for Rust - Scrapy for Rust.

kumo means spider/cloud in Japanese. It gives you a trait-based, async-first API for writing spiders that scrape, follow links, and store data.

Why Kumo?

	kumo	Scrapy (Python)	Colly (Go)
Language	Rust	Python	Go
Type safety	Compile-time	Runtime	Partial
Async model	Tokio (true async)	Twisted (event loop)	goroutines
Memory safety	Guaranteed	GC	GC
CSS / XPath / Regex / JSONPath	Yes	Yes	CSS only
`#[derive(Extract)]` macro	Yes	No	No
LLM extraction (Claude / OpenAI / Gemini / Ollama)	Yes	No	No
Browser / JS rendering	Yes (chromiumoxide)	Yes (Playwright)	No
Stealth mode (TLS/HTTP2 fingerprint spoofing)	Yes	No	No
Distributed frontier (Redis)	Yes	Yes (scrapy-redis)	No
Item stream API	Yes	No	No
Typed crawl events	Yes	Signals	No
OpenTelemetry export	Yes	No	No
Pluggable stores (JSONL, CSV, Postgres, SQLite, MySQL)	Yes	Yes (pipelines)	No
Single binary deploy	Yes	No	Yes
Binary size / startup	Small / instant	Large / slow	Small / fast

Benchmark results - 1,000 books, concurrency 16, median of 3 runs:

	kumo	Colly (Go)	Scrapy (Python)
Real site - Items/s	76.7	73.5	53.3
Local server - Items/s	12,346	4,098	180
Peak RSS	12.5 MB	31.4 MB	77.2 MB

On raw parsing throughput (local server, no network): 3.0x faster than Colly, 69x faster than Scrapy. See the benchmark folder for full methodology and reproduction steps.

Features

Async-first - Tokio-based with bounded concurrency via JoinSet
CSS selectors - res.css(".selector") backed by scraper
XPath selectors - res.xpath("//h1/text()") for XML/HTML documents (feature: xpath)
Regex selectors - res.re(r"\d+"), el.re_first(r"..."), works on Response, Element, and ElementList
JSONPath selectors - res.jsonpath("$.store.books[*].title") for JSON responses (feature: jsonpath)
#[derive(Extract)] - generate CSS extraction boilerplate from field annotations (feature: derive)
Rate limiting - token-bucket RateLimiter via governor
Auto-throttle - adaptive delay based on EWMA latency and 429/503 back-off
Retry with backoff - exponential backoff, status filtering, jitter, and Retry-After support
Item stream - CrawlEngine::stream() returns an async Stream for real-time item consumption
Typed crawl events - CrawlEvent lifecycle hooks for dashboards, progress bars, alerts, and embedded runners
robots.txt - per-domain fetch + cache, enabled by default, with Crawl-delay support
Bloom filter dedup - O(1) URL deduplication, 1M URLs at 0.1% false-positive rate
Request scheduling - CrawlRequest supports priority, headers, method/body, metadata, and dont_filter
HTTP cache - disk-backed response cache via .http_cache(dir), optional TTL
Link extractor - LinkExtractor with allow/deny regex, allow_domains, canonicalize, restrict_css
Pluggable storage - JsonlStore, JsonStore, CsvStore, StdoutStore, PostgreSQL, SQLite, MySQL
Middleware chain - proxy rotation, custom headers, status retry, rate limiting, auto-throttle
Domain + depth filtering - allowed_domains() and max_depth() on the Spider trait
Multi-spider engine - run multiple spiders concurrently via .add_spider() / .run_all()
LLM extraction - extract structured data without selectors using Claude, OpenAI, Gemini, or Ollama
Browser fetcher - headless Chromium via chromiumoxide for JS-rendered pages (feature: browser)
Distributed frontier - Redis-backed URL frontier for multi-process crawls (feature: redis-frontier)
Persistent frontier - file-backed URL frontier that survives restarts and exposes recovered state (feature: persistence)
Sitemap spider - SitemapSpider reads sitemap.xml and sitemap index files
Metrics - periodic stats snapshots via tracing::info! using .metrics_interval()
OpenTelemetry - OTLP/gRPC export of all spans to Jaeger, Grafana Tempo, Datadog, etc. (feature: otel)

Installation

[dependencies]
kumo = "0.4"
async-trait = "0.1"
serde = { version = "1", features = ["derive"] }
tokio = { version = "1", features = ["full"] }

For #[derive(Extract)]:

kumo = { version = "0.4", features = ["derive"] }

Quick Start

use kumo::prelude::*;
use serde::Serialize;

#[derive(Debug, Serialize)]
struct Quote {
    text: String,
    author: String,
}

struct QuotesSpider;

#[async_trait::async_trait]
impl Spider for QuotesSpider {
    type Item = Quote;

    fn name(&self) -> &str { "quotes" }

    fn start_urls(&self) -> Vec<String> {
        vec!["https://quotes.toscrape.com".into()]
    }

    async fn parse(&self, res: &Response) -> Result<Output<Self::Item>, KumoError> {
        let quotes: Vec<Quote> = res.css(".quote").iter().map(|el| Quote {
            text:   el.css(".text").first().map(|e| e.text()).unwrap_or_default(),
            author: el.css(".author").first().map(|e| e.text()).unwrap_or_default(),
        }).collect();

        let next = res.css("li.next a").first()
            .and_then(|el| el.attr("href"))
            .map(|href| res.urljoin(&href));

        let mut output = Output::new().items(quotes);
        if let Some(url) = next { output = output.follow(url); }
        Ok(output)
    }
}

#[tokio::main]
async fn main() -> Result<(), KumoError> {
    CrawlEngine::builder()
        .concurrency(5)
        .middleware(DefaultHeaders::new().user_agent("kumo/0.4"))
        .store(JsonlStore::new("quotes.jsonl")?)
        .run(QuotesSpider)
        .await?;
    Ok(())
}

For more examples - production crawl controls, rate limiting, database stores, LLM extraction, browser mode, and all selector types - see the examples/ folder.

Documentation

Full documentation at kumo.wihlarkop.com

Getting Started
Spiders
Extractors
derive(Extract)
Middleware
Stores
Advanced topics - item stream, OpenTelemetry, stealth, browser, and more
Examples
Feature Flags

Contributing

# Install lefthook (one-time setup)
# macOS
brew install lefthook

# Windows
scoop install lefthook
# or: winget install lefthook

# Linux
curl -1sLf 'https://dl.lefthook.dev/setup.sh' | sh

# Activate the hooks
lefthook install

After lefthook install, every git commit will automatically run cargo fmt (auto-fix) and cargo clippy before the commit goes through.

License

MIT

kumo 0.4.0