schemaorg-validate 0.3.0

Parse and validate Schema.org structured data (JSON-LD, Microdata, RDFa) against the official vocabulary and Google Rich Results profiles.
Documentation
# schemaorg-rs

[![Crates.io](https://img.shields.io/crates/v/schemaorg-validate.svg)](https://crates.io/crates/schemaorg-validate)
[![docs.rs](https://docs.rs/schemaorg-validate/badge.svg)](https://docs.rs/schemaorg-validate)
[![CI](https://github.com/mitrovicsinisaa/schemaorg-rs/actions/workflows/ci.yml/badge.svg)](https://github.com/mitrovicsinisaa/schemaorg-rs/actions/workflows/ci.yml)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
[![npm](https://img.shields.io/npm/v/@schemaorg-rs/wasm.svg)](https://www.npmjs.com/package/@schemaorg-rs/wasm)

**The first open-source, offline, embeddable Schema.org structured data validator.**

Parse JSON-LD, Microdata, and RDFa into a unified data model. Validate against
the official Schema.org vocabulary (v30.0). Check Google Rich Results eligibility.
All offline, all embeddable, all from a single Rust library.

---

## Current Status

| Component | Status | Details |
|-----------|--------|---------|
| **Extraction Engine** | ✅ Stable | JSON-LD, Microdata, RDFa Lite — 3 formats, unified output |
| **Vocabulary Validation** | ✅ Stable | 800+ types, 1400+ properties, Schema.org v30.0 |
| **Rich Results Profiles** | ✅ Stable | 7 Google profiles + baseline |
| **CLI** | ✅ Stable | `schemaorg-validate` — text, JSON, SARIF output |
| **WASM / npm** | ✅ Stable | [`@schemaorg-rs/wasm`]https://www.npmjs.com/package/@schemaorg-rs/wasm |
| **Test Suite** | ✅ 290 tests | All passing, all features |

Published on [crates.io](https://crates.io/crates/schemaorg-validate) and [npm](https://www.npmjs.com/package/@schemaorg-rs/wasm).

---

## Quick Start

### As a Library

```rust
use schemaorg_rs::{extract_all, validation};
use schemaorg_rs::profiles::{ProfileRegistry, Eligibility};

let html = r#"<script type="application/ld+json">{
  "@context": "https://schema.org",
  "@type": "Product",
  "name": "Widget",
  "offers": { "@type": "Offer", "price": "29.99", "priceCurrency": "EUR" }
}</script>"#;

// Extract -> Validate -> Profile
let graph = extract_all(html).unwrap();
let result = validation::validate(&graph);
let registry = ProfileRegistry::with_google();
let profile = registry.evaluate("google", &graph, &result.diagnostics).unwrap();

match profile.eligibility {
    Eligibility::Eligible => println!("Rich result eligible!"),
    Eligibility::WarningsOnly => println!("Eligible with warnings"),
    Eligibility::NotEligible => println!("Not eligible"),
    Eligibility::Restricted => println!("Restricted"),
}
```

### As a CLI Tool

```bash
# Install
cargo install schemaorg-validate

# Validate a file
schemaorg-validate --file page.html --profile google

# Validate a URL
schemaorg-validate --url https://example.com --profile google

# JSON output for CI
schemaorg-validate --file page.html --format json

# SARIF for GitHub Code Scanning
schemaorg-validate --file page.html --format sarif > results.sarif
```

### As an npm Package (WASM)

```bash
npm install @schemaorg-rs/wasm
```

```javascript
import { validateHtml } from '@schemaorg-rs/wasm';
const result = JSON.parse(validateHtml(htmlString));
```

---

## Features

| Feature | Description |
|---------|-------------|
| **Extraction** | JSON-LD, Microdata, RDFa Lite into unified `SchemaNode` model |
| **Validation** | Type/property/value checking against Schema.org v30.0 |
| **Profiles** | Google Rich Results eligibility for 7 schema types |
| **CLI** | `schemaorg-validate` with text, JSON, and SARIF output |
| **WASM** | Browser/Node.js via WebAssembly |
| **Offline** | Vocabulary vendored at compile time, zero network calls |

### Extraction

- **JSON-LD**`@graph` arrays, `@id` cross-references, nested objects, source locations
- **Microdata**`itemscope`/`itemprop`, `itemref`, value extraction by element type
- **RDFa Lite**`vocab`/`typeof`/`property`, `prefix` namespaces, `resource` identifiers

### Validation

- Unknown/deprecated/pending types and properties
- Property domain checking (wrong type for property)
- Value type mismatches (Number where URL expected)
- Enum validation, boolean/number coercion warnings
- "Did you mean?" suggestions via Levenshtein distance

### Google Rich Results Profiles

| Type | Subtypes |
|------|----------|
| Product ||
| Article | NewsArticle, BlogPosting |
| FAQPage | — (restricted since 2024) |
| BreadcrumbList ||
| LocalBusiness | Restaurant, Store, all subtypes |
| Event ||
| Recipe ||

### CLI Output Formats

- **Text** — colored, human-readable with eligibility summary
- **JSON** — structured output for programmatic consumption
- **SARIF 2.1.0** — GitHub Code Scanning compatible

---

## Installation

### Library

```toml
[dependencies]
schemaorg-validate = "0.3"
```

### CLI

```bash
cargo install schemaorg-validate
```

### Feature Flags

| Flag | Default | Enables |
|------|---------|---------|
| `extraction` | Yes | HTML parsing, all 3 extractors |
| `validation` | No | Schema.org vocabulary validation |
| `profiles` | No | Rich Results profiles |
| `wasm` | No | WASM bindings |
| `cli` | No | CLI binary (`schemaorg-validate`) |
| `full` | No | extraction + validation + profiles |

```toml
# Full library (no CLI/WASM)
schemaorg-validate = { version = "0.3", features = ["full"] }

# Core types only (no HTML parsing)
schemaorg-validate = { version = "0.3", default-features = false }
```

---

## Usage

### Extract all formats

```rust
use schemaorg_rs::{extract_all, SourceFormat};

let graph = extract_all(html)?;
for node in &graph.nodes {
    println!("{:?}: {:?}", node.source_format, node.types);
}
```

### Validate against vocabulary

```rust
use schemaorg_rs::{extract_all, validation};

let graph = extract_all(html)?;
let result = validation::validate(&graph);

for diag in &result.diagnostics {
    println!("[{}] {} -- {}", diag.severity, diag.path, diag.message);
}
```

### Check Rich Results eligibility

```rust
use schemaorg_rs::profiles::{ProfileRegistry, Eligibility};

let registry = ProfileRegistry::with_google();
let result = registry.evaluate("google", &graph, &diagnostics)?;

for tr in &result.type_results {
    println!("{}: eligible={}, missing={:?}",
        tr.schema_type, tr.eligible, tr.required_missing);
}
```

---

## GitHub Action

```yaml
- uses: mitrovicsinisaa/schemaorg-rs/.github/actions/schemaorg-validate@main
  with:
    files: 'dist/**/*.html'
    profile: google
    upload-sarif: 'true'
```

See [Action README](.github/actions/schemaorg-validate/README.md) for full options.

---

## Architecture

```
HTML input
  │
  ├─ JSON-LD extractor ──────┐
  ├─ Microdata extractor ────┤──▶ StructuredDataGraph
  └─ RDFa Lite extractor ────┘         │
                                        ├──▶ Vocabulary Validator (Schema.org v30.0)
                                        │         │
                                        │         ▼
                                        │    ValidationResult (diagnostics)
                                        │         │
                                        └──▶ Profile Engine ──▶ ProfileResult (eligibility)
                                                  │
                                              ┌───┴────┐
                                          Google    Baseline
                                        (7 types)  (generic)
```

The Schema.org vocabulary (800+ types, 1400+ properties) is resolved entirely at
**compile time** via `build.rs` codegen. Runtime validation uses static `match`
trees — zero heap allocation, zero parsing, ideal for WASM.

---

## Documentation

- [User Guide](docs/guide.md) — getting started, all features
- [CLI Reference](docs/cli.md) — all options, output formats, SARIF rule IDs
- [Architecture](docs/architecture.md) — internal design, data flow, codegen
- [Profile Docs](docs/profiles/) — required/recommended fields per type
- [API Reference](https://docs.rs/schemaorg-validate) — full Rust docs
- [Contributing](CONTRIBUTING.md) — dev setup, code standards, adding profiles
- [Changelog](CHANGELOG.md) — version history

---

## Why This Exists

Schema.org structured data is embedded in hundreds of millions of web pages.
When it's broken — a missing `name` on a `Product`, a wrong value type on
`offers.price` — search engines silently ignore it. No rich results. No AI
citations. No visibility.

The only validators that understand Schema.org semantically are closed-source,
hosted by Google, and require sending your URLs to their servers.

`schemaorg-rs` is the first open-source, offline, embeddable Schema.org
validator. It runs in Rust, WASM, and CLI. It validates vocabulary correctness
*and* Rich Results eligibility in one pass.

---

## Roadmap

The core library is stable and shipping. Future work focuses on ecosystem
integration and expanded coverage:

- **CMS Integrations** — Shopware 6 plugin, TYPO3 extension, WordPress plugin
- **Language Bindings** — Python (PyO3), PHP extension, native Node.js (napi)
- **Additional Profiles** — VideoObject, JobPosting, HowTo, Course, Review, Dataset
- **Hosted API** — Self-hostable HTTP API + Docker image (open-source Rich Results Test alternative)
- **Auto-fix Engine** — Not just "this is broken" but "here's the corrected JSON-LD"
- **Schema.org W3C Engagement** — `schema:pending` support, upstream test fixtures

See [CHANGELOG.md](CHANGELOG.md) for version history.

---

## License

MIT — see [LICENSE](LICENSE)