stygian-graph
High-performance, graph-based web scraping engine treating pipelines as DAGs with pluggable service modules.
Features
| Feature | Description |
|---|---|
| Hexagonal architecture | Domain core isolated from infrastructure concerns |
| Graph execution | DAG-based pipeline with topological sort, wave-by-wave execution |
| Pluggable adapters | HTTP, browser, AI providers, storage — add custom services easily |
| AI extraction | Claude, GPT, Gemini, GitHub Copilot, Ollama — structured data from HTML |
| Multi-modal | Images, PDFs, videos via vision APIs |
| Distributed execution | Redis/Valkey work queues for horizontal scaling |
| Circuit breaker | Degradation when services fail (browser → HTTP fallback) |
| Idempotency | Safe retries with deduplication keys |
| Observability | Prometheus metrics, structured tracing |
Installation
[]
= "0.1"
= { = "1", = ["full"] }
= "1"
Enable optional features:
= { = "0.1", = ["browser", "ai-claude", "distributed"] }
Quick Start
Basic Scraping Pipeline
use ;
use ;
use json;
async
With Browser Rendering
use ;
use Duration;
let config = BrowserAdapterConfig ;
let browser_adapter = with_config;
With AI Extraction
use ;
let config = ClaudeConfig ;
let ai = with_config;
Architecture
Hexagonal (Ports & Adapters)
┌─────────────────────────────────────────────┐
│ Application Layer │
│ (DagExecutor, ServiceRegistry, Metrics) │
└──────────────────┬──────────────────────────┘
│
┌──────────────────▼──────────────────────────┐
│ Port Traits │
│ (ScrapingService, AiProvider, WorkQueue) │
└──────────────────┬──────────────────────────┘
│
┌──────────────────▼──────────────────────────┐
│ Adapters │
│ HTTP │ Browser │ Claude │ Redis │ ... │
└─────────────────────────────────────────────┘
Domain Rules
- Zero I/O in domain — all external interactions through ports
- Dependency inversion — adapters depend on ports, never vice versa
- Typestate pattern — compile-time pipeline validation
- Zero-cost abstractions — generics over Arc/Box where possible
Pipeline Configuration
Define scraping flows as JSON:
Validation
Pipelines are validated before execution:
- Node integrity — IDs unique, services registered
- Edge validity — all edges connect existing nodes
- Cycle detection — Kahn's topological sort
- Reachability — all nodes connected in single DAG
Adapters
HTTP Adapter
use ;
let config = HttpConfig ;
let adapter = with_config;
Browser Adapter
Requires browser feature + stygian-browser crate:
use ;
let adapter = with_config;
AI Adapters
Claude (Anthropic):
use ClaudeAdapter;
let adapter = new;
OpenAI:
use OpenAiAdapter;
let adapter = new;
Gemini (Google):
use GeminiAdapter;
let adapter = new;
Distributed Execution
Use Redis/Valkey for work queue backend:
use ;
let queue = new.await?;
let executor = new; // 10 workers
let results = executor.execute_wave.await?;
Observability
Prometheus Metrics
use MetricsCollector;
let metrics = new;
let prometheus_handler = metrics.prometheus_handler;
// Expose on /metrics endpoint
new
.route
Structured Tracing
use ;
registry
.with
.with
.init;
Testing
# Unit tests
# Integration tests
# All features (browser integration tests require Chrome)
# Benchmarks
# Measure coverage (requires cargo-tarpaulin)
Coverage: ~72% line coverage across 209 workspace tests. Key modules at or near 100%:
config, executor, idempotency, service_registry, and all AI adapter unit tests.
Adapters requiring live external services (HTTP, browser) are tested with mock ports.
Performance
- Concurrency: Tokio for I/O, Rayon for CPU-bound
- Zero-copy:
Arc<str>for shared strings - Lock-free: DashMap for concurrent access
- Pool reuse: HTTP clients, browser instances
Benchmarks (Apple M4 Pro):
- DAG executor: ~50µs overhead per wave
- HTTP adapter: ~2ms per request (cached DNS)
- Browser adapter: <100ms acquisition (warm pool)
Examples
See examples/ for complete pipelines:
basic-scrape.toml— Simple HTTP → parse flowjavascript-rendering.toml— Browser-based extractionmulti-provider.toml— AI fallback chaindistributed.toml— Redis work queue setup
License
Licensed under the GNU Affero General Public License v3.0 (AGPL-3.0-only).