kumo
An async web crawling framework for Rust - Scrapy for Rust.
kumo means spider/cloud in Japanese. It gives you a trait-based, async-first API for writing spiders that scrape, follow links, and store data.
Why Kumo?
| kumo | Scrapy (Python) | Colly (Go) | |
|---|---|---|---|
| Language | Rust | Python | Go |
| Type safety | Compile-time | Runtime | Partial |
| Async model | Tokio (true async) | Twisted (event loop) | goroutines |
| Memory safety | Guaranteed | GC | GC |
| CSS / XPath / Regex / JSONPath | Yes | Yes | CSS only |
#[derive(Extract)] macro |
Yes | No | No |
| LLM extraction (Claude / OpenAI / Gemini / Ollama) | Yes | No | No |
| Browser / JS rendering | Yes (chromiumoxide) | Yes (Playwright) | No |
| Stealth mode (TLS/HTTP2 fingerprint spoofing) | Yes | No | No |
| Distributed frontier (Redis) | Yes | Yes (scrapy-redis) | No |
| Item stream API | Yes | No | No |
| Typed crawl events | Yes | Signals | No |
| Async crawl hooks | Yes | Signals / extensions | No |
| OpenTelemetry export | Yes | No | No |
| Pluggable stores (JSONL, CSV, Postgres, SQLite, MySQL) | Yes | Yes (pipelines) | No |
| Single binary deploy | Yes | No | Yes |
| Binary size / startup | Small / instant | Large / slow | Small / fast |
Benchmark results - 1,000 books, concurrency 16, median of 3 runs:
| kumo | Colly (Go) | Scrapy (Python) | |
|---|---|---|---|
| Real site - Items/s | 76.7 | 73.5 | 53.3 |
| Local server - Items/s | 12,346 | 4,098 | 180 |
| Peak RSS | 12.5 MB | 31.4 MB | 77.2 MB |
On raw parsing throughput (local server, no network): 3.0x faster than Colly, 69x faster than Scrapy. See the benchmark folder for full methodology and reproduction steps.
Features
- Async-first - Tokio-based with bounded concurrency via
JoinSet - CSS selectors -
res.css(".selector")backed byscraper - XPath selectors -
res.xpath("//h1/text()")for XML/HTML documents (feature:xpath) - Regex selectors -
res.re(r"\d+"),el.re_first(r"..."), works onResponse,Element, andElementList - JSONPath selectors -
res.jsonpath("$.store.books[*].title")for JSON responses (feature:jsonpath) #[derive(Extract)]- generate CSS extraction boilerplate from field annotations (feature:derive)- Rate limiting - token-bucket
RateLimiterviagovernor - Auto-throttle - adaptive delay based on EWMA latency and 429/503 back-off
- Retry with backoff - exponential backoff, status filtering, jitter, and
Retry-Aftersupport - Item stream -
CrawlEngine::stream()returns an asyncStreamfor real-time item consumption - Typed crawl events -
CrawlEventsignals for dashboards, progress bars, alerts, and embedded runners - Async crawl hooks -
CrawlHookextension points for metrics, audit logging, persistence, and policy checks - robots.txt - per-domain fetch + cache, enabled by default, with
Crawl-delaysupport - Bloom filter dedup - O(1) URL deduplication, 1M URLs at 0.1% false-positive rate
- Request scheduling -
CrawlRequestsupports priority, headers, method/body, metadata, anddont_filter - HTTP cache - disk-backed response cache via
.http_cache(dir), optional TTL - Link extractor -
LinkExtractorwith allow/deny regex,allow_domains,canonicalize,restrict_css - Pluggable storage -
JsonlStore,JsonStore,CsvStore,StdoutStore, PostgreSQL, SQLite, MySQL - Middleware chain - proxy rotation, custom headers, status retry, rate limiting, auto-throttle
- Domain + depth filtering -
allowed_domains()andmax_depth()on theSpidertrait - Multi-spider engine - run multiple spiders concurrently via
.add_spider()/.run_all() - LLM extraction - extract structured data without selectors using Claude, OpenAI, Gemini, or Ollama
- Browser fetcher - headless Chromium via chromiumoxide for JS-rendered pages (feature:
browser) - Distributed frontier - Redis-backed URL frontier for multi-process crawls (feature:
redis-frontier) - Persistent frontier - file-backed URL frontier that survives restarts and exposes recovered state (feature:
persistence) - Sitemap spider -
SitemapSpiderreadssitemap.xmland sitemap index files - Metrics - periodic stats snapshots via
tracing::info!using.metrics_interval() - OpenTelemetry - OTLP/gRPC export of all spans to Jaeger, Grafana Tempo, Datadog, etc. (feature:
otel)
Installation
[]
= "0.5"
= "0.1"
= { = "1", = ["derive"] }
= { = "1", = ["full"] }
For #[derive(Extract)]:
= { = "0.5", = ["derive"] }
Quick Start
use *;
use Serialize;
;
async
For more examples - production crawl controls, rate limiting, database stores, LLM extraction, browser mode, and all selector types - see the examples/ folder.
Documentation
Full documentation at kumo.wihlarkop.com
- Getting Started
- Spiders
- Extractors
- derive(Extract)
- Middleware
- Stores
- Advanced topics - item stream, OpenTelemetry, stealth, browser, and more
- Examples
- Feature Flags
Contributing
# Install lefthook (one-time setup)
# macOS
# Windows
# or: winget install lefthook
# Linux
|
# Activate the hooks
After lefthook install, every git commit will automatically run cargo fmt (auto-fix) and cargo clippy before the commit goes through.
License
MIT