schemaorg-validate 0.3.0

Parse and validate Schema.org structured data (JSON-LD, Microdata, RDFa) against the official vocabulary and Google Rich Results profiles.
Documentation
# User Guide

`schemaorg-rs` is a Rust library for extracting, validating, and profiling
[Schema.org](https://schema.org) structured data from HTML documents.

## Table of Contents

- [Installation]#installation
- [Quick Start]#quick-start
- [Extraction]#extraction
- [Validation]#validation
- [Rich Results Profiles]#rich-results-profiles
- [CLI Usage]#cli-usage
- [WASM / npm]#wasm--npm
- [Feature Flags]#feature-flags

---

## Installation

### As a Rust library

```toml
[dependencies]
schemaorg-rs = "0.1"
```

### As a CLI tool

```bash
cargo install schemaorg-validate
```

### As an npm package (WASM)

```bash
npm install @schemaorg-rs/wasm
```

---

## Quick Start

```rust
use schemaorg_rs::{extract_all, validation};
use schemaorg_rs::profiles::{ProfileRegistry, Eligibility};

let html = r#"<script type="application/ld+json">{
  "@context": "https://schema.org",
  "@type": "Product",
  "name": "Widget",
  "offers": { "@type": "Offer", "price": "29.99", "priceCurrency": "EUR" }
}</script>"#;

// 1. Extract structured data
let graph = extract_all(html).unwrap();

// 2. Validate against Schema.org vocabulary
let result = validation::validate(&graph);
if result.has_errors() {
    for diag in result.errors() {
        eprintln!("{}: {}", diag.path, diag.message);
    }
}

// 3. Check Rich Results eligibility
let registry = ProfileRegistry::with_google();
let profile_result = registry.evaluate("google", &graph, &result.diagnostics).unwrap();
match profile_result.eligibility {
    Eligibility::Eligible => println!("Rich result eligible!"),
    Eligibility::WarningsOnly => println!("Eligible with warnings"),
    Eligibility::NotEligible => println!("Not eligible"),
    Eligibility::Restricted => println!("Restricted eligibility"),
}
```

---

## Extraction

The extraction engine parses HTML and produces a unified `StructuredDataGraph`
containing all structured data found across three formats:

### Supported formats

| Format | HTML Pattern | Extractor |
|--------|-------------|-----------|
| JSON-LD | `<script type="application/ld+json">` | `JsonLdExtractor` |
| Microdata | `itemscope` / `itemprop` attributes | `MicrodataExtractor` |
| RDFa Lite | `vocab` / `typeof` / `property` attributes | `RdfaLiteExtractor` |

### Extract all formats at once

```rust
use schemaorg_rs::extract_all;

let graph = extract_all(html)?;
for node in &graph.nodes {
    println!("{:?}: {:?}", node.source_format, node.types);
}
```

### Use a specific extractor

```rust
use schemaorg_rs::{Extractor, JsonLdExtractor};

let output = JsonLdExtractor.extract(html)?;
```

### Pre-parse HTML for performance

When running multiple extractors on the same document, parse once:

```rust
use schemaorg_rs::{Html, JsonLdExtractor, MicrodataExtractor};

let document = Html::parse_document(html);
let jsonld = JsonLdExtractor.extract_from_document(&document, html)?;
let microdata = MicrodataExtractor.extract_from_document(&document)?;
```

### Features

- `@id` cross-reference resolution (JSON-LD)
- `@graph` array support (JSON-LD)
- `itemref` attribute support (Microdata)
- `prefix` namespace mappings (RDFa)
- Source location tracking (line, column, byte offset)
- Depth-limited recursion for DoS protection
- Automatic Schema.org URL prefix stripping

---

## Validation

The validation engine checks extracted data against the official Schema.org
vocabulary definitions (compiled at build time from the vendored JSON-LD file).

### What it checks

| Check | Example |
|-------|---------|
| Unknown types | `"@type": "Produc"` -- did you mean `Product`? |
| Unknown properties | `"namee": "Widget"` -- did you mean `name`? |
| Wrong property domain | `price` on `Person` (should be on `Offer`) |
| Value type mismatches | Number where URL expected |
| Deprecated types/properties | Retired to `attic.schema.org` |
| Pending types/properties | In `pending.schema.org` |
| Boolean as string | `"true"` instead of `true` |
| Enum validation | `"InStock"` vs invalid enum value |

### Usage

```rust
use schemaorg_rs::{extract_all, validation};

let graph = extract_all(html)?;
let result = validation::validate(&graph);

for diag in &result.diagnostics {
    println!("[{}] {} -- {}", diag.severity, diag.path, diag.message);
}
```

### Severity levels

- **Error** -- invalid per Schema.org specification
- **Warning** -- deprecated, likely unintended, or potentially incorrect
- **Info** -- informational (e.g. pending types)

---

## Rich Results Profiles

Profiles add platform-specific rules on top of vocabulary validation.
They answer: "Will Google actually show a rich result for this markup?"

### Supported Google profiles

| Type | Required Fields | Key Features |
|------|----------------|--------------|
| Product | `name` | Offer validation, review/rating checks |
| Article | `headline`, `image`, `datePublished`, `author` | Publisher validation |
| FAQPage | `mainEntity` with Q&A pairs | Restricted eligibility (2024+) |
| BreadcrumbList | `itemListElement` | Position/URL validation |
| LocalBusiness | `name`, `address` | Address component checks |
| Event | `name`, `startDate`, `location` | Place/address validation |
| Recipe | `name`, `image` | Instruction/ingredient checks |

### Usage

```rust
use schemaorg_rs::profiles::{ProfileRegistry, Eligibility};

let registry = ProfileRegistry::with_google();
let result = registry.evaluate("google", &graph, &vocab_diagnostics)?;

for tr in &result.type_results {
    println!("{}: eligible={}, missing={:?}",
        tr.schema_type, tr.eligible, tr.required_missing);
}
```

### Baseline profile

The baseline profile checks generic Schema.org best practices (name, description,
image, URL scheme) without platform-specific rules:

```rust
let registry = ProfileRegistry::with_baseline();
let result = registry.evaluate("baseline", &graph, &vocab_diagnostics)?;
```

---

## CLI Usage

See [CLI Reference](cli.md) for the full command reference.

### Quick examples

```bash
# Validate a local file
schemaorg-validate --file page.html --profile google

# Validate a URL
schemaorg-validate --url https://example.com --profile google

# JSON output for CI
schemaorg-validate --file page.html --format json --profile google

# SARIF output for GitHub Code Scanning
schemaorg-validate --file page.html --format sarif --profile google

# Only errors, no warnings
schemaorg-validate --file page.html --severity error

# Exit code only
schemaorg-validate --file page.html --quiet
```

---

## WASM / npm

The library compiles to WebAssembly for use in browsers and Node.js.

```bash
npm install @schemaorg-rs/wasm
```

```javascript
import { validateHtml } from '@schemaorg-rs/wasm';

const result = JSON.parse(validateHtml(htmlString));
console.log(result.eligibility);
```

See `wasm/README.md` for detailed API documentation.

---

## Feature Flags

| Flag | Default | Enables |
|------|---------|---------|
| `extraction` | Yes | HTML parsing, all 3 extractors |
| `validation` | No | Schema.org vocabulary validation |
| `profiles` | No | Rich Results profiles (requires `validation`) |
| `wasm` | No | WASM bindings (requires `profiles`) |
| `cli` | No | CLI binary (requires `profiles`) |
| `full` | No | `extraction` + `validation` + `profiles` |

### Minimal usage (types only, no HTML parsing)

```toml
[dependencies]
schemaorg-rs = { version = "0.1", default-features = false }
```

### Full library

```toml
[dependencies]
schemaorg-rs = { version = "0.1", features = ["full"] }
```