# kumo
[](https://github.com/wihlarkop/kumo/actions/workflows/ci.yml)
[](https://kumo.wihlarkop.com)
<p align="center">
<img src="assets/logo.png" alt="kumo logo" width="200">
</p>
An async web crawling framework for Rust - Scrapy for Rust.
**kumo** means spider/cloud in Japanese. It gives you a trait-based, async-first API for writing spiders that scrape, follow links, and store data.
## Why Kumo?
| Language | Rust | Python | Go |
| Type safety | Compile-time | Runtime | Partial |
| Async model | Tokio (true async) | Twisted (event loop) | goroutines |
| Memory safety | Guaranteed | GC | GC |
| CSS / XPath / Regex / JSONPath | Yes | Yes | CSS only |
| `#[derive(Extract)]` macro | Yes | No | No |
| LLM extraction (Claude / OpenAI / Gemini / Ollama) | Yes | No | No |
| Browser / JS rendering | Yes (chromiumoxide) | Yes (Playwright) | No |
| Stealth mode (TLS/HTTP2 fingerprint spoofing) | Yes | No | No |
| Distributed frontier (Redis) | Yes | Yes (scrapy-redis) | No |
| Item stream API | Yes | No | No |
| Typed crawl events | Yes | Signals | No |
| OpenTelemetry export | Yes | No | No |
| Pluggable stores (JSONL, CSV, Postgres, SQLite, MySQL) | Yes | Yes (pipelines) | No |
| Single binary deploy | Yes | No | Yes |
| Binary size / startup | Small / instant | Large / slow | Small / fast |
**Benchmark results** - 1,000 books, concurrency 16, median of 3 runs:
| Real site - Items/s | **76.7** | 73.5 | 53.3 |
| Local server - Items/s | **12,346** | 4,098 | 180 |
| Peak RSS | **12.5 MB** | 31.4 MB | 77.2 MB |
On raw parsing throughput (local server, no network): **3.0x faster than Colly, 69x faster than Scrapy**. See the [benchmark folder](https://github.com/wihlarkop/kumo/tree/main/benchmark) for full methodology and reproduction steps.
## Features
- **Async-first** - Tokio-based with bounded concurrency via `JoinSet`
- **CSS selectors** - `res.css(".selector")` backed by `scraper`
- **XPath selectors** - `res.xpath("//h1/text()")` for XML/HTML documents (feature: `xpath`)
- **Regex selectors** - `res.re(r"\d+")`, `el.re_first(r"...")`, works on `Response`, `Element`, and `ElementList`
- **JSONPath selectors** - `res.jsonpath("$.store.books[*].title")` for JSON responses (feature: `jsonpath`)
- **`#[derive(Extract)]`** - generate CSS extraction boilerplate from field annotations (feature: `derive`)
- **Rate limiting** - token-bucket `RateLimiter` via `governor`
- **Auto-throttle** - adaptive delay based on EWMA latency and 429/503 back-off
- **Retry with backoff** - exponential backoff, status filtering, jitter, and `Retry-After` support
- **Item stream** - `CrawlEngine::stream()` returns an async `Stream` for real-time item consumption
- **Typed crawl events** - `CrawlEvent` lifecycle hooks for dashboards, progress bars, alerts, and embedded runners
- **robots.txt** - per-domain fetch + cache, enabled by default, with `Crawl-delay` support
- **Bloom filter dedup** - O(1) URL deduplication, 1M URLs at 0.1% false-positive rate
- **Request scheduling** - `CrawlRequest` supports priority, headers, method/body, metadata, and `dont_filter`
- **HTTP cache** - disk-backed response cache via `.http_cache(dir)`, optional TTL
- **Link extractor** - `LinkExtractor` with allow/deny regex, `allow_domains`, `canonicalize`, `restrict_css`
- **Pluggable storage** - `JsonlStore`, `JsonStore`, `CsvStore`, `StdoutStore`, PostgreSQL, SQLite, MySQL
- **Middleware chain** - proxy rotation, custom headers, status retry, rate limiting, auto-throttle
- **Domain + depth filtering** - `allowed_domains()` and `max_depth()` on the `Spider` trait
- **Multi-spider engine** - run multiple spiders concurrently via `.add_spider()` / `.run_all()`
- **LLM extraction** - extract structured data without selectors using Claude, OpenAI, Gemini, or Ollama
- **Browser fetcher** - headless Chromium via chromiumoxide for JS-rendered pages (feature: `browser`)
- **Distributed frontier** - Redis-backed URL frontier for multi-process crawls (feature: `redis-frontier`)
- **Persistent frontier** - file-backed URL frontier that survives restarts and exposes recovered state (feature: `persistence`)
- **Sitemap spider** - `SitemapSpider` reads `sitemap.xml` and sitemap index files
- **Metrics** - periodic stats snapshots via `tracing::info!` using `.metrics_interval()`
- **OpenTelemetry** - OTLP/gRPC export of all spans to Jaeger, Grafana Tempo, Datadog, etc. (feature: `otel`)
## Installation
```toml
[dependencies]
kumo = "0.4"
async-trait = "0.1"
serde = { version = "1", features = ["derive"] }
tokio = { version = "1", features = ["full"] }
```
For `#[derive(Extract)]`:
```toml
kumo = { version = "0.4", features = ["derive"] }
```
## Quick Start
```rust
use kumo::prelude::*;
use serde::Serialize;
#[derive(Debug, Serialize)]
struct Quote {
text: String,
author: String,
}
struct QuotesSpider;
#[async_trait::async_trait]
impl Spider for QuotesSpider {
type Item = Quote;
fn name(&self) -> &str { "quotes" }
fn start_urls(&self) -> Vec<String> {
vec!["https://quotes.toscrape.com".into()]
}
async fn parse(&self, res: &Response) -> Result<Output<Self::Item>, KumoError> {
let quotes: Vec<Quote> = res.css(".quote").iter().map(|el| Quote {
text: el.css(".text").first().map(|e| e.text()).unwrap_or_default(),
author: el.css(".author").first().map(|e| e.text()).unwrap_or_default(),
}).collect();
let next = res.css("li.next a").first()
.and_then(|el| el.attr("href"))
.map(|href| res.urljoin(&href));
let mut output = Output::new().items(quotes);
if let Some(url) = next { output = output.follow(url); }
Ok(output)
}
}
#[tokio::main]
async fn main() -> Result<(), KumoError> {
CrawlEngine::builder()
.concurrency(5)
.middleware(DefaultHeaders::new().user_agent("kumo/0.4"))
.store(JsonlStore::new("quotes.jsonl")?)
.run(QuotesSpider)
.await?;
Ok(())
}
```
For more examples - production crawl controls, rate limiting, database stores, LLM extraction, browser mode, and all selector types - see the [`examples/`](examples/) folder.
## Documentation
Full documentation at **[kumo.wihlarkop.com](https://kumo.wihlarkop.com)**
- [Getting Started](https://kumo.wihlarkop.com/getting-started/)
- [Spiders](https://kumo.wihlarkop.com/spiders/)
- [Extractors](https://kumo.wihlarkop.com/extractors/)
- [derive(Extract)](https://kumo.wihlarkop.com/derive/)
- [Middleware](https://kumo.wihlarkop.com/middleware/)
- [Stores](https://kumo.wihlarkop.com/stores/)
- [Advanced topics](https://kumo.wihlarkop.com/advanced/stream/) - item stream, OpenTelemetry, stealth, browser, and more
- [Examples](https://kumo.wihlarkop.com/examples/)
- [Feature Flags](https://kumo.wihlarkop.com/feature-flags/)
## Contributing
```bash
# Install lefthook (one-time setup)
# macOS
brew install lefthook
# Windows
scoop install lefthook
# or: winget install lefthook
# Linux
# Activate the hooks
lefthook install
```
After `lefthook install`, every `git commit` will automatically run `cargo fmt` (auto-fix) and `cargo clippy` before the commit goes through.
## License
MIT