schemaorg-validate 0.2.0

Parse and validate Schema.org structured data (JSON-LD, Microdata, RDFa) against the official vocabulary and Google Rich Results profiles.
Documentation
# schemaorg-rs

[![Crates.io](https://img.shields.io/crates/v/schemaorg-rs.svg)](https://crates.io/crates/schemaorg-rs)
[![docs.rs](https://docs.rs/schemaorg-rs/badge.svg)](https://docs.rs/schemaorg-rs)
[![CI](https://github.com/mitrovicsinisaa/schemaorg-rs/actions/workflows/ci.yml/badge.svg)](https://github.com/mitrovicsinisaa/schemaorg-rs/actions/workflows/ci.yml)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
[![npm](https://img.shields.io/npm/v/@schemaorg-rs/wasm.svg)](https://www.npmjs.com/package/@schemaorg-rs/wasm)

**Extract, validate, and profile Schema.org structured data from HTML.**

Parse JSON-LD, Microdata, and RDFa into a unified data model. Validate against
the official Schema.org vocabulary. Check Google Rich Results eligibility.
All offline, all embeddable, all from a single Rust library.

---

## Quick Start

### As a Library

```rust
use schemaorg_rs::{extract_all, validation};
use schemaorg_rs::profiles::{ProfileRegistry, Eligibility};

let html = r#"<script type="application/ld+json">{
  "@context": "https://schema.org",
  "@type": "Product",
  "name": "Widget",
  "offers": { "@type": "Offer", "price": "29.99", "priceCurrency": "EUR" }
}</script>"#;

// Extract -> Validate -> Profile
let graph = extract_all(html).unwrap();
let result = validation::validate(&graph);
let registry = ProfileRegistry::with_google();
let profile = registry.evaluate("google", &graph, &result.diagnostics).unwrap();

match profile.eligibility {
    Eligibility::Eligible => println!("Rich result eligible!"),
    Eligibility::WarningsOnly => println!("Eligible with warnings"),
    Eligibility::NotEligible => println!("Not eligible"),
    Eligibility::Restricted => println!("Restricted"),
}
```

### As a CLI Tool

```bash
# Install
cargo install schemaorg-validate

# Validate a file
schemaorg-validate --file page.html --profile google

# Validate a URL
schemaorg-validate --url https://example.com --profile google

# JSON output for CI
schemaorg-validate --file page.html --format json

# SARIF for GitHub Code Scanning
schemaorg-validate --file page.html --format sarif > results.sarif
```

### As an npm Package (WASM)

```bash
npm install @schemaorg-rs/wasm
```

```javascript
import { validateHtml } from '@schemaorg-rs/wasm';
const result = JSON.parse(validateHtml(htmlString));
```

---

## Features

| Feature | Description |
|---------|-------------|
| **Extraction** | JSON-LD, Microdata, RDFa Lite into unified `SchemaNode` model |
| **Validation** | Type/property/value checking against Schema.org v30.0 |
| **Profiles** | Google Rich Results eligibility for 7 schema types |
| **CLI** | `schemaorg-validate` with text, JSON, and SARIF output |
| **WASM** | Browser/Node.js via WebAssembly |
| **Offline** | Vocabulary vendored at compile time, zero network calls |

### Extraction

- **JSON-LD** -- `@graph` arrays, `@id` cross-references, nested objects, source locations
- **Microdata** -- `itemscope`/`itemprop`, `itemref`, value extraction by element type
- **RDFa Lite** -- `vocab`/`typeof`/`property`, `prefix` namespaces, `resource` identifiers

### Validation

- Unknown/deprecated/pending types and properties
- Property domain checking (wrong type for property)
- Value type mismatches (Number where URL expected)
- Enum validation, boolean/number coercion warnings
- "Did you mean?" suggestions via Levenshtein distance

### Google Rich Results Profiles

| Type | Subtypes |
|------|----------|
| Product | -- |
| Article | NewsArticle, BlogPosting |
| FAQPage | -- (restricted since 2024) |
| BreadcrumbList | -- |
| LocalBusiness | Restaurant, Store, all subtypes |
| Event | -- |
| Recipe | -- |

### CLI Output Formats

- **Text** -- colored, human-readable with eligibility summary
- **JSON** -- structured output for programmatic consumption
- **SARIF 2.1.0** -- GitHub Code Scanning compatible

---

## Installation

### Library

```toml
[dependencies]
schemaorg-rs = "0.1"
```

### CLI

```bash
cargo install schemaorg-validate
```

### Feature Flags

| Flag | Default | Enables |
|------|---------|---------|
| `extraction` | Yes | HTML parsing, all 3 extractors |
| `validation` | No | Schema.org vocabulary validation |
| `profiles` | No | Rich Results profiles |
| `wasm` | No | WASM bindings |
| `cli` | No | CLI binary (`schemaorg-validate`) |
| `full` | No | extraction + validation + profiles |

```toml
# Full library (no CLI/WASM)
schemaorg-rs = { version = "0.1", features = ["full"] }

# Core types only (no HTML parsing)
schemaorg-rs = { version = "0.1", default-features = false }
```

---

## Usage

### Extract all formats

```rust
use schemaorg_rs::{extract_all, SourceFormat};

let graph = extract_all(html)?;
for node in &graph.nodes {
    println!("{:?}: {:?}", node.source_format, node.types);
}
```

### Validate against vocabulary

```rust
use schemaorg_rs::{extract_all, validation};

let graph = extract_all(html)?;
let result = validation::validate(&graph);

for diag in &result.diagnostics {
    println!("[{}] {} -- {}", diag.severity, diag.path, diag.message);
}
```

### Check Rich Results eligibility

```rust
use schemaorg_rs::profiles::{ProfileRegistry, Eligibility};

let registry = ProfileRegistry::with_google();
let result = registry.evaluate("google", &graph, &diagnostics)?;

for tr in &result.type_results {
    println!("{}: eligible={}, missing={:?}",
        tr.schema_type, tr.eligible, tr.required_missing);
}
```

---

## GitHub Action

```yaml
- uses: mitrovicsinisaa/schemaorg-rs/.github/actions/schemaorg-validate@main
  with:
    files: 'dist/**/*.html'
    profile: google
    upload-sarif: 'true'
```

See [Action README](.github/actions/schemaorg-validate/README.md) for full options.

---

## Documentation

- [User Guide](docs/guide.md) -- getting started, all features
- [CLI Reference](docs/cli.md) -- all options, output formats, SARIF rule IDs
- [Architecture](docs/architecture.md) -- internal design, data flow, codegen
- [Profile Docs](docs/profiles/) -- required/recommended fields per type
- [API Reference](https://docs.rs/schemaorg-rs) -- full Rust docs
- [Contributing](CONTRIBUTING.md) -- dev setup, code standards, adding profiles
- [Changelog](CHANGELOG.md) -- version history

---

## Why This Exists

Schema.org structured data is embedded in hundreds of millions of web pages.
When it's broken -- a missing `name` on a `Product`, a wrong value type on
`offers.price` -- search engines silently ignore it. No rich results. No AI
citations. No visibility.

The only validators that understand Schema.org semantically are closed-source,
hosted by Google, and require sending your URLs to their servers.

`schemaorg-rs` is the first open-source, offline, embeddable Schema.org
validator. It runs in Rust, WASM, and CLI. It validates vocabulary correctness
*and* Rich Results eligibility in one pass.

---

## License

MIT -- see [LICENSE](LICENSE)