schemaorg-validate 0.3.0

Parse and validate Schema.org structured data (JSON-LD, Microdata, RDFa) against the official vocabulary and Google Rich Results profiles.
Documentation
# Architecture

Internal design documentation for `schemaorg-rs`.

## Crate Layout

```
schemaorg-rs/
  src/
    lib.rs              # Public API surface, feature gates
    types.rs            # SchemaNode, SchemaValue, SourceFormat
    error.rs            # ExtractionError, ExtractionWarning
    graph.rs            # extract_all(), StructuredDataGraph
    extraction/
      mod.rs            # Extractor trait, shared helpers
      jsonld.rs         # JSON-LD extractor (~42KB)
      microdata.rs      # Microdata extractor (~34KB)
      rdfa.rs           # RDFa Lite extractor (~28KB)
    validation/
      mod.rs            # validate(), ValidationResult
      diagnostics.rs    # Severity, DiagnosticCode, ValidationDiagnostic
      type_checker.rs   # Unknown/deprecated/pending type checks
      property_checker.rs # Property domain + superseded checks
      value_checker.rs  # Value type matching + coercion
    vocabulary/
      mod.rs            # Public API: lookup_type/property/enum
      types.rs          # TypeDef, PropertyDef, EnumMemberDef
      (generated.rs)    # build.rs output in $OUT_DIR
    profiles/
      mod.rs            # Profile trait, ProfileRegistry, Eligibility
      engine.rs         # Graph evaluation + eligibility aggregation
      baseline.rs       # Generic best-practice profile
      google/
        mod.rs          # register_all()
        common.rs       # Shared helpers (has_property, check_nested)
        article.rs      # Google Article profile
        breadcrumb.rs   # Google BreadcrumbList profile
        event.rs        # Google Event profile
        faqpage.rs      # Google FAQPage profile
        local_business.rs # Google LocalBusiness profile
        product.rs      # Google Product profile
        recipe.rs       # Google Recipe profile
    sarif.rs            # SARIF 2.1.0 output (cli feature)
    wasm.rs             # WASM bindings (wasm feature)
    bin/
      validate.rs       # CLI binary (cli feature)
  build.rs              # Codegen pipeline
  schema-data/          # Vendored Schema.org vocabulary
  wasm/                 # npm package wrapper
  scripts/
    build-wasm.sh       # WASM build + optimization script
```

## Data Flow

```
HTML input
    |
    v
scraper::Html::parse_document()
    |
    +---> JsonLdExtractor   ---> ExtractionOutput { nodes, warnings }
    |                                   |
    +---> MicrodataExtractor ---> ExtractionOutput { nodes, warnings }
    |                                   |
    +---> RdfaLiteExtractor  ---> ExtractionOutput { nodes, warnings }
    |
    v
StructuredDataGraph { nodes: Vec<SchemaNode>, warnings: Vec<ExtractionWarning> }
    |
    v
validation::validate(&graph)
    |
    +---> type_checker::check_type()       -- unknown, deprecated, pending
    +---> property_checker::check_property() -- domain, superseded
    +---> value_checker::check_value()     -- type mismatch, coercion
    |
    v
ValidationResult { diagnostics: Vec<ValidationDiagnostic> }
    |
    v
ProfileRegistry::evaluate("google", &graph, &diagnostics)
    |
    +---> engine::evaluate_graph() for each registered profile
    |     +---> Profile::evaluate_node() for matching nodes
    |     +---> aggregate_eligibility()
    |
    v
ProfileResult { eligibility, type_results, diagnostics }
```

## Codegen Pipeline

The vocabulary data is compiled into Rust code at build time:

```
schema-data/schemaorg-all-https.jsonld
    |
    v
build.rs
    |
    +---> Parse JSON-LD vocabulary
    +---> Resolve type inheritance (flatten parent chains)
    +---> Collect all_properties = own + inherited (sorted)
    +---> Generate static match statements
    |
    v
$OUT_DIR/generated.rs
    |
    +---> fn lookup_type("Product") -> Option<&'static TypeDef>
    +---> fn lookup_property("name") -> Option<&'static PropertyDef>
    +---> fn lookup_enum_member("InStock") -> Option<&'static EnumMemberDef>
    +---> fn schema_version() -> &'static str
    +---> const ALL_TYPE_NAMES: &[&str]
    +---> const ALL_PROPERTY_NAMES: &[&str]
```

**Key design decisions:**

1. **Static match statements** instead of HashMaps -- zero heap allocation,
   the compiler optimizes these into efficient jump tables
2. **Inheritance resolved at compile time** -- `Product.all_properties` includes
   everything from `Thing`, no runtime traversal needed for property lookup
3. **Binary search** on sorted `all_properties` slices -- O(log n) property
   membership checks via `TypeDef::has_property()`

## Feature Flag Architecture

```
extraction -------+
                  |
validation -------+---> profiles --+--> wasm
                                   |
                                   +--> cli
                                   
full = extraction + validation + profiles
```

Each layer depends on the previous:
- `validation` requires `extraction` (needs `StructuredDataGraph`)
- `profiles` requires `validation` (needs `ValidationDiagnostic`)
- `wasm` and `cli` require `profiles` (full pipeline)

**Core types** (`SchemaNode`, `SchemaValue`, `SourceFormat`, error types)
are always available even with `default-features = false`.

## WASM Strategy

- Target: `wasm32-unknown-unknown` via `wasm-pack`
- Interface: functions accept/return `String` (JSON serialized)
- No `serde-wasm-bindgen` -- keeps the boundary simple and debuggable
- Size budget: < 500KB for the `.wasm` binary
- Optimizations: `opt-level = 'z'`, LTO, single codegen unit, `wasm-opt -Oz`

## Profile Engine Design

Profiles implement the `Profile` trait:

```rust
trait Profile: Send + Sync {
    fn name(&self) -> &'static str;
    fn supported_types(&self) -> &[&str];
    fn evaluate_node(&self, node: &SchemaNode, vocab_diagnostics: &[ValidationDiagnostic])
        -> NodeProfileResult;
}
```

The engine matches nodes to profiles via type checking (including subtype
inheritance via BFS). Multiple profiles with the same name run independently
and their results merge.

**Eligibility aggregation:**
- All types eligible + no diagnostics = `Eligible`
- All types eligible + warnings only = `WarningsOnly`
- Any type not eligible = `NotEligible`
- `EligibilityRestricted` diagnostic present = `Restricted`

## Error Handling

- **Library code:** `thiserror` for all error types
- **CLI binary:** `thiserror` with a `CliError` enum (no `anyhow`)
- **Extraction:** lenient -- individual format failures become warnings,
  other formats still produce results
- **Validation:** infallible -- always returns `ValidationResult`
- **Profiles:** `ProfileError` for registry misuse (unknown profile name)