# User Guide
`schemaorg-rs` is a Rust library for extracting, validating, and profiling
[Schema.org](https://schema.org) structured data from HTML documents.
## Table of Contents
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Extraction](#extraction)
- [Validation](#validation)
- [Rich Results Profiles](#rich-results-profiles)
- [CLI Usage](#cli-usage)
- [WASM / npm](#wasm--npm)
- [Feature Flags](#feature-flags)
---
## Installation
### As a Rust library
```toml
[dependencies]
schemaorg-rs = "0.1"
```
### As a CLI tool
```bash
cargo install schemaorg-validate
```
### As an npm package (WASM)
```bash
npm install @schemaorg-rs/wasm
```
---
## Quick Start
```rust
use schemaorg_rs::{extract_all, validation};
use schemaorg_rs::profiles::{ProfileRegistry, Eligibility};
let html = r#"<script type="application/ld+json">{
"@context": "https://schema.org",
"@type": "Product",
"name": "Widget",
"offers": { "@type": "Offer", "price": "29.99", "priceCurrency": "EUR" }
}</script>"#;
// 1. Extract structured data
let graph = extract_all(html).unwrap();
// 2. Validate against Schema.org vocabulary
let result = validation::validate(&graph);
if result.has_errors() {
for diag in result.errors() {
eprintln!("{}: {}", diag.path, diag.message);
}
}
// 3. Check Rich Results eligibility
let registry = ProfileRegistry::with_google();
let profile_result = registry.evaluate("google", &graph, &result.diagnostics).unwrap();
match profile_result.eligibility {
Eligibility::Eligible => println!("Rich result eligible!"),
Eligibility::WarningsOnly => println!("Eligible with warnings"),
Eligibility::NotEligible => println!("Not eligible"),
Eligibility::Restricted => println!("Restricted eligibility"),
}
```
---
## Extraction
The extraction engine parses HTML and produces a unified `StructuredDataGraph`
containing all structured data found across three formats:
### Supported formats
| JSON-LD | `<script type="application/ld+json">` | `JsonLdExtractor` |
| Microdata | `itemscope` / `itemprop` attributes | `MicrodataExtractor` |
| RDFa Lite | `vocab` / `typeof` / `property` attributes | `RdfaLiteExtractor` |
### Extract all formats at once
```rust
use schemaorg_rs::extract_all;
let graph = extract_all(html)?;
for node in &graph.nodes {
println!("{:?}: {:?}", node.source_format, node.types);
}
```
### Use a specific extractor
```rust
use schemaorg_rs::{Extractor, JsonLdExtractor};
let output = JsonLdExtractor.extract(html)?;
```
### Pre-parse HTML for performance
When running multiple extractors on the same document, parse once:
```rust
use schemaorg_rs::{Html, JsonLdExtractor, MicrodataExtractor};
let document = Html::parse_document(html);
let jsonld = JsonLdExtractor.extract_from_document(&document, html)?;
let microdata = MicrodataExtractor.extract_from_document(&document)?;
```
### Features
- `@id` cross-reference resolution (JSON-LD)
- `@graph` array support (JSON-LD)
- `itemref` attribute support (Microdata)
- `prefix` namespace mappings (RDFa)
- Source location tracking (line, column, byte offset)
- Depth-limited recursion for DoS protection
- Automatic Schema.org URL prefix stripping
---
## Validation
The validation engine checks extracted data against the official Schema.org
vocabulary definitions (compiled at build time from the vendored JSON-LD file).
### What it checks
| Unknown types | `"@type": "Produc"` -- did you mean `Product`? |
| Unknown properties | `"namee": "Widget"` -- did you mean `name`? |
| Wrong property domain | `price` on `Person` (should be on `Offer`) |
| Value type mismatches | Number where URL expected |
| Deprecated types/properties | Retired to `attic.schema.org` |
| Pending types/properties | In `pending.schema.org` |
| Boolean as string | `"true"` instead of `true` |
| Enum validation | `"InStock"` vs invalid enum value |
### Usage
```rust
use schemaorg_rs::{extract_all, validation};
let graph = extract_all(html)?;
let result = validation::validate(&graph);
for diag in &result.diagnostics {
println!("[{}] {} -- {}", diag.severity, diag.path, diag.message);
}
```
### Severity levels
- **Error** -- invalid per Schema.org specification
- **Warning** -- deprecated, likely unintended, or potentially incorrect
- **Info** -- informational (e.g. pending types)
---
## Rich Results Profiles
Profiles add platform-specific rules on top of vocabulary validation.
They answer: "Will Google actually show a rich result for this markup?"
### Supported Google profiles
| Product | `name` | Offer validation, review/rating checks |
| Article | `headline`, `image`, `datePublished`, `author` | Publisher validation |
| FAQPage | `mainEntity` with Q&A pairs | Restricted eligibility (2024+) |
| BreadcrumbList | `itemListElement` | Position/URL validation |
| LocalBusiness | `name`, `address` | Address component checks |
| Event | `name`, `startDate`, `location` | Place/address validation |
| Recipe | `name`, `image` | Instruction/ingredient checks |
### Usage
```rust
use schemaorg_rs::profiles::{ProfileRegistry, Eligibility};
let registry = ProfileRegistry::with_google();
let result = registry.evaluate("google", &graph, &vocab_diagnostics)?;
for tr in &result.type_results {
println!("{}: eligible={}, missing={:?}",
tr.schema_type, tr.eligible, tr.required_missing);
}
```
### Baseline profile
The baseline profile checks generic Schema.org best practices (name, description,
image, URL scheme) without platform-specific rules:
```rust
let registry = ProfileRegistry::with_baseline();
let result = registry.evaluate("baseline", &graph, &vocab_diagnostics)?;
```
---
## CLI Usage
See [CLI Reference](cli.md) for the full command reference.
### Quick examples
```bash
# Validate a local file
schemaorg-validate --file page.html --profile google
# Validate a URL
schemaorg-validate --url https://example.com --profile google
# JSON output for CI
schemaorg-validate --file page.html --format json --profile google
# SARIF output for GitHub Code Scanning
schemaorg-validate --file page.html --format sarif --profile google
# Only errors, no warnings
schemaorg-validate --file page.html --severity error
# Exit code only
schemaorg-validate --file page.html --quiet
```
---
## WASM / npm
The library compiles to WebAssembly for use in browsers and Node.js.
```bash
npm install @schemaorg-rs/wasm
```
```javascript
import { validateHtml } from '@schemaorg-rs/wasm';
const result = JSON.parse(validateHtml(htmlString));
console.log(result.eligibility);
```
See `wasm/README.md` for detailed API documentation.
---
## Feature Flags
| `extraction` | Yes | HTML parsing, all 3 extractors |
| `validation` | No | Schema.org vocabulary validation |
| `profiles` | No | Rich Results profiles (requires `validation`) |
| `wasm` | No | WASM bindings (requires `profiles`) |
| `cli` | No | CLI binary (requires `profiles`) |
| `full` | No | `extraction` + `validation` + `profiles` |
### Minimal usage (types only, no HTML parsing)
```toml
[dependencies]
schemaorg-rs = { version = "0.1", default-features = false }
```
### Full library
```toml
[dependencies]
schemaorg-rs = { version = "0.1", features = ["full"] }
```