kumo
An async web crawling framework for Rust — Scrapy for Rust.
kumo (蜘蛛/雲 — spider/cloud) gives you a trait-based, async-first API for writing spiders that scrape, follow links, and store data.
Why Kumo?
| kumo | Scrapy (Python) | Colly (Go) | |
|---|---|---|---|
| Language | Rust | Python | Go |
| Type safety | Compile-time | Runtime | Partial |
| Async model | Tokio (true async) | Twisted (event loop) | goroutines |
| Memory safety | Guaranteed | GC | GC |
| CSS / XPath / Regex / JSONPath | ✓ | ✓ | CSS only |
#[derive(Extract)] macro |
✓ | ✗ | ✗ |
| LLM extraction (Claude / OpenAI / Gemini / Ollama) | ✓ | ✗ | ✗ |
| Browser / JS rendering | ✓ (chromiumoxide) | ✓ (Playwright) | ✗ |
| Stealth mode (TLS/HTTP2 fingerprint spoofing) | ✓ | ✗ | ✗ |
| Distributed frontier (Redis) | ✓ | ✓ (scrapy-redis) | ✗ |
| Item stream API | ✓ | ✗ | ✗ |
| OpenTelemetry export | ✓ | ✗ | ✗ |
| Pluggable stores (JSONL, CSV, Postgres, SQLite, MySQL) | ✓ | ✓ (pipelines) | ✗ |
| Single binary deploy | ✓ | ✗ | ✓ |
| Binary size / startup | Small / instant | Large / slow | Small / fast |
Benchmark results — 1 000 books, concurrency 16, median of 3 runs:
| kumo | Colly (Go) | Scrapy (Python) | |
|---|---|---|---|
| Real site — Items/s | 76.7 | 73.5 | 53.3 |
| Local server — Items/s | 12 346 | 4 098 | 180 |
| Peak RSS | 12.5 MB | 31.4 MB | 77.2 MB |
On raw parsing throughput (local server, no network): 3.0× faster than Colly, 69× faster than Scrapy. See the benchmark folder for full methodology and reproduction steps.
Features
- Async-first — Tokio-based with bounded concurrency via
JoinSet - CSS selectors —
res.css(".selector")backed byscraper - XPath selectors —
res.xpath("//h1/text()")for XML/HTML documents (feature:xpath) - Regex selectors —
res.re(r"\d+"),el.re_first(r"..."), works onResponse,Element, andElementList - JSONPath selectors —
res.jsonpath("$.store.books[*].title")for JSON responses (feature:jsonpath) #[derive(Extract)]— generate CSS extraction boilerplate from field annotations (feature:derive)- Rate limiting — token-bucket
RateLimiterviagovernor - Auto-throttle — adaptive delay based on EWMA latency and 429/503 back-off
- Retry with backoff — exponential backoff via
.retry(max, base_delay) - Item stream —
CrawlEngine::stream()returns an asyncStreamfor real-time item consumption - robots.txt — per-domain fetch + cache, enabled by default
- Bloom filter dedup — O(1) URL deduplication, 1M URLs at 0.1% false-positive rate
- Request scheduling —
CrawlRequestsupports priority, headers, method/body, metadata, anddont_filter - HTTP cache — disk-backed response cache via
.http_cache(dir), optional TTL - Link extractor —
LinkExtractorwith allow/deny regex,allow_domains,canonicalize,restrict_css - Pluggable storage —
JsonlStore,JsonStore,CsvStore,StdoutStore, PostgreSQL, SQLite, MySQL - Middleware chain — proxy rotation, custom headers, rate limiting, auto-throttle
- Domain + depth filtering —
allowed_domains()andmax_depth()on theSpidertrait - Multi-spider engine — run multiple spiders concurrently via
.add_spider()/.run_all() - LLM extraction — extract structured data without selectors using Claude, OpenAI, Gemini, or Ollama
- Browser fetcher — headless Chromium via chromiumoxide for JS-rendered pages (feature:
browser) - Distributed frontier — Redis-backed URL frontier for multi-process crawls (feature:
redis-frontier) - Persistent frontier — file-backed URL frontier that survives restarts (feature:
persistence) - Sitemap spider —
SitemapSpiderreadssitemap.xmland sitemap index files - Metrics — periodic stats snapshots via
tracing::info!using.metrics_interval() - OpenTelemetry — OTLP/gRPC export of all spans to Jaeger, Grafana Tempo, Datadog, etc. (feature:
otel)
Installation
[]
= "0.2"
= "0.1"
= { = "1", = ["derive"] }
= { = "1", = ["full"] }
For #[derive(Extract)]:
= { = "0.2", = ["derive"] }
Quick Start
use *;
use Serialize;
;
async
For more examples — rate limiting, database stores, LLM extraction, browser mode, and all selector types — see the examples/ folder.
Documentation
Full documentation at kumo.wihlarkop.com
- Getting Started
- Spiders
- Extractors
- derive(Extract)
- Middleware
- Stores
- Advanced topics — item stream, OpenTelemetry, stealth, browser, and more
- Examples
- Feature Flags
Contributing
# Install lefthook (one-time setup)
# macOS
# Windows
# or: winget install lefthook
# Linux
|
# Activate the hooks
After lefthook install, every git commit will automatically run cargo fmt (auto-fix) and cargo clippy before the commit goes through.
License
MIT