mocra-0.2.6 has been yanked.
Visit the last successful build:
mocra-0.1.6
mocra
A distributed, event-driven crawling and data collection framework for Rust.
English | 中文
mocra is a Rust framework for building scalable data collection pipelines. It models crawling as queue-driven processing stages orchestrated by a DAG execution engine, with automatic scaling from single-process to distributed clusters.
Features
- DAG-based module system — define linear chains or custom fan-out/fan-in graphs
- Queue-driven pipeline — task → request → download → parse → store, fully decoupled
- Auto-scaling runtime — single-node (in-memory) or distributed (Redis/Kafka) with zero code changes
- Bounded concurrency — semaphore-controlled worker pools with pause/resume/shutdown
- Middleware pipeline — download, data transformation, and storage middleware with weight-based ordering
- Built-in control plane — HTTP API for health, metrics, pause/resume, task injection, and DLQ inspection
- Prometheus metrics — unified
mocra_*metric families for throughput, latency, errors, and backlog - Cron scheduling — periodic task execution with cron expressions
- Error recovery — policy-driven retry, fallback gates, circuit breakers, and dead-letter queues
Quick Start
Add to your Cargo.toml:
[]
= "0.2"
= "0.1"
= { = "1", = ["full"] }
= "1"
= "0.3"
Write a module:
use Arc;
use async_trait;
use stream;
use ;
use *;
use LoginInfo;
use State;
use Engine;
;
;
async
Create config.toml:
= "my_crawler"
[]
= "sqlite://data/crawler.db?mode=rwc"
[]
= 60
[]
= 30
= 10
[]
= 3
[]
= 5000
Run:
Architecture
┌─────────────────────────────────────────────────────────┐
│ Engine │
│ │
│ TaskEvent ──▶ generate() ──▶ download() ──▶ parser() │
│ │ │ │ │ │
│ [Task Q] [Request Q] [Response Q] [Parser Q] │
│ │ │
│ ┌────────────┬──────┘ │
│ ▼ ▼ │
│ [Data Store] [Next Node] │
│ [Error Q → DLQ] │
└─────────────────────────────────────────────────────────┘
Each stage is decoupled by a message queue. Queues are local Tokio channels in single-node mode, or Redis Streams / Kafka in distributed mode — same code, zero changes.
DAG Execution
Define complex pipelines with fan-out and fan-in:
async
┌── branch_a ──┐
start ─┤ ├── merge
└── branch_b ──┘
Single-Node vs Distributed
| Single-Node | Distributed | |
|---|---|---|
| Config | No cache.redis |
Add cache.redis |
| Queues | Tokio mpsc (in-memory) | Redis Streams or Kafka |
| Cache | In-memory | Redis |
| Locks | Local mutex | Redis distributed locks |
| Workers | 1 process | N processes, same binary |
| Code changes | None | None |
Switch to distributed by adding Redis to your config:
[]
= "redis://localhost:6379"
Documentation
| Document | Description |
|---|---|
| Getting Started | Installation and first module |
| Architecture | System design and pipeline internals |
| Module Development | ModuleTrait, ModuleNodeTrait, data passing |
| DAG Guide | DAG definition, fan-out/fan-in, advance gates |
| Middleware | Download, data, and storage middleware |
| Configuration | Full TOML configuration reference |
| API Reference | HTTP control plane endpoints |
| Deployment | Single-node, distributed, monitoring |
Examples
See the simple/ directory for runnable examples:
simple/module_node_trait_dag.rs— Fan-out/fan-in DAG module
Monitoring
# Start Prometheus + Grafana
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3000
# Metrics: http://localhost:8080/metrics
License
Licensed under either of:
- MIT license
- Apache License, Version 2.0
at your option.