stygian-plugin 0.13.4

Visual data extraction fallback subsystem with CSS/XPath selectors, idempotent request handling, and composable transformation pipelines.
Documentation
# stygian-plugin

A Chrome browser plugin fallback scraper for Stygian, providing flexible and interactive visual data extraction as a fallback when stygian-graph and stygian-browser cannot scrape a page.

## Features

- **Template-based extraction**: Define a schema once, apply to multiple elements
- **Recording-based**: User clicks/highlights → generates extraction pattern
- **Query-driven**: CSS and XPath selectors with fallback support
- **Region-based**: Multiple independent zones, each with custom rules
- **Multi-instance extraction**: Iterate over matching elements on a page
- **Transformation pipeline**: Trim, normalize, regex, type coercion, HTML stripping, etc.
- **Idempotent operations**: ULID-based deduplication for safe retries
- **Integrated with stygian-graph**: Implements `ScrapingService` trait for pipeline integration

## Architecture

Following Stygian's hexagonal architecture:

- **Domain** (`src/domain/`): Pure Rust, zero I/O dependencies
- **Ports** (`src/ports.rs`): Trait definitions (PluginTemplateStore, PluginExtractionPort, IdempotencyKeyStore)
- **Adapters** (`src/adapters/`): Concrete implementations
  - ExtractionEngine: CSS selector-based DOM extraction
  - PluginExtractionAdapter: Bridges to stygian-graph's ScrapingService
- **Storage** (`src/storage/`): Persistence adapters
  - FileTemplateStore: JSON file-based template storage
  - MemoryIdempotencyStore: In-memory result caching

## Quick Start

### Creating a Template

```rust
use stygian_plugin::{
    domain::{ExtractionTemplate, Region, Selector, Transformation},
    adapters::ExtractionEngine,
    ports::PluginExtractionPort,
};
use serde_json::json;

// Define a template
let template = ExtractionTemplate::new("Product")
    .with_description("Extract product info from a listing")
    .with_region(
        Region::new(
            "name",
            Selector::css("h2.product-title"),
            json!({"type": "string"}),
        )
        .with_transformation(Transformation::Trim)
    )
    .with_region(
        Region::new(
            "price",
            Selector::css(".product-price"),
            json!({"type": "string"}),
        )
        .with_transformation(Transformation::Regex {
            pattern: r"\\$([0-9.]+)".to_string(),
            replacement: "$1".to_string(),
        })
    );
```

### Executing Extraction

```rust
use stygian_plugin::adapters::ExtractionEngine;
use stygian_plugin::domain::ExtractionRequest;

let html = r#"<html><h2 class="product-title">Widget</h2><span class="product-price">$99.99</span></html>"#;

let request = ExtractionRequest::new(template, "https://example.com", html);
let result = ExtractionEngine::execute(&request)?;

println!("Extracted: {:?}", result.data);
```

### Using with stygian-graph

Register the adapter in your service registry:

```rust
use stygian_plugin::adapters::PluginExtractionAdapter;
use stygian_plugin::storage::{FileTemplateStore, MemoryIdempotencyStore};
use std::sync::Arc;

let adapter = Arc::new(PluginExtractionAdapter::new(
    Arc::new(FileTemplateStore::new("./templates".into())),
    Arc::new(ExtractionEngine),
    Arc::new(MemoryIdempotencyStore::new()),
));

registry.register("plugin", adapter).await?;
```

Then use in a pipeline:

```toml
[[nodes]]
name = "extract-products"
kind = "plugin"
params = { template_id = "uuid-of-template", timeout_ms = 30000 }
```

## Selectors

### CSS Selectors

```rust
Selector::css(".product-card")
```

### XPath Selectors

```rust
Selector::xpath("//div[@class='product']")
```

### Dual Selectors (Recommended)

```rust
Selector::dual(".product", "//div[@class='product']")
```

The engine tries CSS first (faster), then falls back to XPath if no matches.

## Transformations

Transformations are applied in order:

- `Trim`: Remove leading/trailing whitespace
- `Lowercase` / `Uppercase`: Case conversion
- `RemoveWhitespace`: Strip all whitespace
- `NormalizeWhitespace`: Collapse multiple spaces to single space
- `StripHtml`: Remove HTML tags
- `DecodeHtml`: Decode HTML entities
- `Regex { pattern, replacement }`: Regex find-and-replace
- `RegexExtract { pattern, group }`: Extract specific capture group
- `Coerce { target_type }`: Convert to "string", "number", "boolean", "date"
- `Filter { pattern }`: Only include if matches regex
- `ParseJson`: Parse as JSON

Example:

```rust
Region::new("price", selector, schema)
    .with_transformation(Transformation::StripHtml)
    .with_transformation(Transformation::Trim)
    .with_transformation(Transformation::Regex {
        pattern: r"\\$(\\d+\\.\\d{2})".to_string(),
        replacement: "$1".to_string(),
    })
    .with_transformation(Transformation::Coerce {
        target_type: "number".to_string(),
    })
```

## Idempotency

Each extraction request can include an idempotency key:

```rust
let request = ExtractionRequest::new(template, url, html)
    .with_idempotency_key(idempotency_key);
```

If the same key is used again, the cached result is returned (safe for retries).

## Storage

### Templates

```rust
let store = FileTemplateStore::new("./templates".into());
store.save(&template).await?;
let retrieved = store.get(&template.id).await?;
let all = store.list().await?;
store.delete(&template.id).await?;
```

### Idempotency

```rust
let store = MemoryIdempotencyStore::new();
store.store_result(&key, &result).await?;
if let Some(cached) = store.get_result(&key).await? {
    // Use cached result
}
```

## Testing

Run tests:

```bash
cargo test -p stygian-plugin
```

Run examples:

```bash
cargo run --example basic_extraction -p stygian-plugin
```

## Next Steps

- **Phase 3**: MCP tool integration (plugin_apply_template, plugin_record_*, etc.)
- **Phase 4**: Chrome extension (TypeScript, content script, service worker, UI)
- **Phase 5**: CircuitBreaker fallback routing from stygian-graph
- **Phase 6**: Full integration tests, CI/CD, documentation

## License

AGPL-3.0-only OR LicenseRef-Commercial