stygian-graph
High-performance, graph-based web scraping engine treating pipelines as DAGs with pluggable service modules.
Features
| Feature | Description |
|---|---|
| Hexagonal architecture | Domain core isolated from infrastructure concerns |
| Graph execution | DAG-based pipeline with topological sort, wave-by-wave execution |
| Pluggable adapters | HTTP, browser, AI providers, storage — add custom services easily |
| AI extraction | Claude, GPT, Gemini, GitHub Copilot, Ollama — structured data from HTML |
| Multi-modal | Images, PDFs, videos via vision APIs |
| Distributed execution | Redis/Valkey work queues for horizontal scaling |
| Circuit breaker | Degradation when services fail (browser → HTTP fallback) |
| Idempotency | Safe retries with deduplication keys |
| Observability | Prometheus metrics, structured tracing |
Installation
[]
= "*"
= { = "1", = ["full"] }
= "1"
Enable optional features:
= { = "*", = ["browser", "redis", "extract"] }
Feature Reference
| Feature | Dependency | Purpose |
|---|---|---|
browser |
stygian-browser | Browser automation adapter |
extract |
stygian-browser (extract feature) |
Structured data extraction via #[derive(Extract)] |
api |
— | REST API server (Axum routes) |
redis |
redis + deadpool-redis | Redis/Valkey cache & work queue |
postgres |
sqlx | PostgreSQL storage adapter |
object-storage |
rust-s3 | S3-compatible object storage adapter |
scrape-exchange |
— | Scrape Exchange crawler/sink integrations |
cloudflare-crawl |
— | Cloudflare Browser Rendering adapter |
wasm-plugins |
wasmtime | WASM plugin system |
escalation |
— | Tiered escalation policy adapter |
mcp |
— | MCP (Model Context Protocol) tools |
acquisition-runner |
browser |
Optional bridge that lets browser pipeline nodes opt into stygian-browser acquisition runner |
full |
all of above | All features enabled |
Usage
Basic Scraping Pipeline
use ;
use ;
use json;
async
With Browser Rendering
use ;
use Duration;
let config = BrowserAdapterConfig ;
let browser_adapter = with_config;
With AI Extraction
use ;
let config = ClaudeConfig ;
let ai = with_config;
Architecture
Hexagonal (Ports & Adapters)
┌─────────────────────────────────────────────┐
│ Application Layer │
│ (DagExecutor, ServiceRegistry, Metrics) │
└──────────────────┬──────────────────────────┘
│
┌──────────────────▼──────────────────────────┐
│ Port Traits │
│ (ScrapingService, AiProvider, WorkQueue) │
└──────────────────┬──────────────────────────┘
│
┌──────────────────▼──────────────────────────┐
│ Adapters │
│ HTTP │ Browser │ Claude │ Redis │ ... │
└─────────────────────────────────────────────┘
Domain Rules
- Zero I/O in domain — all external interactions through ports
- Dependency inversion — adapters depend on ports, never vice versa
- Typestate pattern — compile-time pipeline validation
- Zero-cost abstractions — generics over Arc/Box where possible
Pipeline Configuration
Define scraping flows as JSON:
Validation
Pipelines are validated before execution:
- Node integrity — IDs unique, services registered
- Edge validity — all edges connect existing nodes
- Cycle detection — Kahn's topological sort
- Reachability — all nodes connected in single DAG
Adapters
HTTP Adapter
use ;
let config = HttpConfig ;
let adapter = with_config;
Browser Adapter
Requires browser feature + stygian-browser crate:
use ;
let adapter = with_config;
AI Adapters
Claude (Anthropic):
use ClaudeAdapter;
let adapter = new;
OpenAI:
use OpenAiAdapter;
let adapter = new;
Gemini (Google):
use GeminiAdapter;
let adapter = new;
Distributed Execution
Use Redis/Valkey for work queue backend:
use ;
let queue = new.await?;
let executor = new; // 10 workers
let results = executor.execute_wave.await?;
Observability
Prometheus Metrics
use MetricsCollector;
let metrics = new;
let prometheus_handler = metrics.prometheus_handler;
// Expose on /metrics endpoint
new
.route
Structured Tracing
use ;
registry
.with
.with
.init;
Testing
# Unit tests
# Integration tests
# All features (browser integration tests require Chrome)
# Benchmarks
# Measure coverage (requires cargo-tarpaulin)
Coverage: ~72% line coverage across 1639 workspace tests. Key modules at or near 100%:
config, executor, idempotency, service_registry, and all AI adapter unit tests.
Adapters requiring live external services (HTTP, browser) are tested with mock ports.
Performance
- Concurrency: Tokio for I/O, Rayon for CPU-bound
- Zero-copy:
Arc<str>for shared strings - Lock-free: DashMap for concurrent access
- Pool reuse: HTTP clients, browser instances
Benchmarks (Apple M4 Pro):
- DAG executor: ~50µs overhead per wave
Optional Acquisition Runner Bridge (Opt-In)
The stygian-graph bridge to the browser acquisition runner is optional and disabled unless you explicitly opt in.
Opt-in requirements:
- Build with feature
acquisition-runner. - Add a node-level
acquisitiontable onbrowsernodes.
Without that node-level acquisition table, browser nodes keep legacy behavior in graph_pipeline_run and are reported as skipped.
Example (pipeline_run TOML):
[[]]
= "browser"
= "browser"
[[]]
= "target"
= "browser"
= "https://example.com"
[]
= "resilient"
= "main"
= 45
Supported acquisition.mode values are fast, resilient, hostile, and investigate.
Migration note (old low-level path vs runner path):
- Old path: browser node behavior relied on existing low-level execution/skip flow only.
- New path: add
[nodes.params.acquisition]to opt into runner execution for that node. - No migration is required for existing pipelines unless you want runner behavior.
Downstream Compatibility Checklist
- Confirm pipelines without
[nodes.params.acquisition]still produce expected skipped browser nodes. - Confirm pipelines with
[nodes.params.acquisition]return acquisition metadata (acquisition_runner, diagnostics) as expected. - Validate both feature sets in CI to prevent accidental behavior changes.
Suggested CI matrix guidance:
# Legacy behavior surface
# Opt-in bridge surface
- HTTP adapter: ~2ms per request (cached DNS)
- Browser adapter: <100ms acquisition (warm pool)
Examples
See examples/ for complete pipelines:
basic-scrape.toml— Simple HTTP → parse flowjavascript-rendering.toml— Browser-based extractionmulti-provider.toml— AI fallback chaindistributed.toml— Redis work queue setup
License
Licensed under the GNU Affero General Public License v3.0 (AGPL-3.0-only).