# Architecture
Internal design documentation for `schemaorg-rs`.
## Crate Layout
```
schemaorg-rs/
src/
lib.rs # Public API surface, feature gates
types.rs # SchemaNode, SchemaValue, SourceFormat
error.rs # ExtractionError, ExtractionWarning
graph.rs # extract_all(), StructuredDataGraph
extraction/
mod.rs # Extractor trait, shared helpers
jsonld.rs # JSON-LD extractor (~42KB)
microdata.rs # Microdata extractor (~34KB)
rdfa.rs # RDFa Lite extractor (~28KB)
validation/
mod.rs # validate(), ValidationResult
diagnostics.rs # Severity, DiagnosticCode, ValidationDiagnostic
type_checker.rs # Unknown/deprecated/pending type checks
property_checker.rs # Property domain + superseded checks
value_checker.rs # Value type matching + coercion
vocabulary/
mod.rs # Public API: lookup_type/property/enum
types.rs # TypeDef, PropertyDef, EnumMemberDef
(generated.rs) # build.rs output in $OUT_DIR
profiles/
mod.rs # Profile trait, ProfileRegistry, Eligibility
engine.rs # Graph evaluation + eligibility aggregation
baseline.rs # Generic best-practice profile
google/
mod.rs # register_all()
common.rs # Shared helpers (has_property, check_nested)
article.rs # Google Article profile
breadcrumb.rs # Google BreadcrumbList profile
event.rs # Google Event profile
faqpage.rs # Google FAQPage profile
local_business.rs # Google LocalBusiness profile
product.rs # Google Product profile
recipe.rs # Google Recipe profile
sarif.rs # SARIF 2.1.0 output (cli feature)
wasm.rs # WASM bindings (wasm feature)
bin/
validate.rs # CLI binary (cli feature)
build.rs # Codegen pipeline
schema-data/ # Vendored Schema.org vocabulary
wasm/ # npm package wrapper
scripts/
build-wasm.sh # WASM build + optimization script
```
## Data Flow
```
HTML input
|
v
scraper::Html::parse_document()
|
+---> JsonLdExtractor ---> ExtractionOutput { nodes, warnings }
| |
+---> MicrodataExtractor ---> ExtractionOutput { nodes, warnings }
| |
+---> RdfaLiteExtractor ---> ExtractionOutput { nodes, warnings }
|
v
StructuredDataGraph { nodes: Vec<SchemaNode>, warnings: Vec<ExtractionWarning> }
|
v
validation::validate(&graph)
|
+---> type_checker::check_type() -- unknown, deprecated, pending
+---> property_checker::check_property() -- domain, superseded
+---> value_checker::check_value() -- type mismatch, coercion
|
v
ValidationResult { diagnostics: Vec<ValidationDiagnostic> }
|
v
ProfileRegistry::evaluate("google", &graph, &diagnostics)
|
+---> engine::evaluate_graph() for each registered profile
| +---> Profile::evaluate_node() for matching nodes
| +---> aggregate_eligibility()
|
v
ProfileResult { eligibility, type_results, diagnostics }
```
## Codegen Pipeline
The vocabulary data is compiled into Rust code at build time:
```
schema-data/schemaorg-all-https.jsonld
|
v
build.rs
|
+---> Parse JSON-LD vocabulary
+---> Resolve type inheritance (flatten parent chains)
+---> Collect all_properties = own + inherited (sorted)
+---> Generate static match statements
|
v
$OUT_DIR/generated.rs
|
+---> fn lookup_type("Product") -> Option<&'static TypeDef>
+---> fn lookup_property("name") -> Option<&'static PropertyDef>
+---> fn lookup_enum_member("InStock") -> Option<&'static EnumMemberDef>
+---> fn schema_version() -> &'static str
+---> const ALL_TYPE_NAMES: &[&str]
+---> const ALL_PROPERTY_NAMES: &[&str]
```
**Key design decisions:**
1. **Static match statements** instead of HashMaps -- zero heap allocation,
the compiler optimizes these into efficient jump tables
2. **Inheritance resolved at compile time** -- `Product.all_properties` includes
everything from `Thing`, no runtime traversal needed for property lookup
3. **Binary search** on sorted `all_properties` slices -- O(log n) property
membership checks via `TypeDef::has_property()`
## Feature Flag Architecture
```
extraction -------+
|
validation -------+---> profiles --+--> wasm
|
+--> cli
full = extraction + validation + profiles
```
Each layer depends on the previous:
- `validation` requires `extraction` (needs `StructuredDataGraph`)
- `profiles` requires `validation` (needs `ValidationDiagnostic`)
- `wasm` and `cli` require `profiles` (full pipeline)
**Core types** (`SchemaNode`, `SchemaValue`, `SourceFormat`, error types)
are always available even with `default-features = false`.
## WASM Strategy
- Target: `wasm32-unknown-unknown` via `wasm-pack`
- Interface: functions accept/return `String` (JSON serialized)
- No `serde-wasm-bindgen` -- keeps the boundary simple and debuggable
- Size budget: < 500KB for the `.wasm` binary
- Optimizations: `opt-level = 'z'`, LTO, single codegen unit, `wasm-opt -Oz`
## Profile Engine Design
Profiles implement the `Profile` trait:
```rust
trait Profile: Send + Sync {
fn name(&self) -> &'static str;
fn supported_types(&self) -> &[&str];
fn evaluate_node(&self, node: &SchemaNode, vocab_diagnostics: &[ValidationDiagnostic])
-> NodeProfileResult;
}
```
The engine matches nodes to profiles via type checking (including subtype
inheritance via BFS). Multiple profiles with the same name run independently
and their results merge.
**Eligibility aggregation:**
- All types eligible + no diagnostics = `Eligible`
- All types eligible + warnings only = `WarningsOnly`
- Any type not eligible = `NotEligible`
- `EligibilityRestricted` diagnostic present = `Restricted`
## Error Handling
- **Library code:** `thiserror` for all error types
- **CLI binary:** `thiserror` with a `CliError` enum (no `anyhow`)
- **Extraction:** lenient -- individual format failures become warnings,
other formats still produce results
- **Validation:** infallible -- always returns `ValidationResult`
- **Profiles:** `ProfileError` for registry misuse (unknown profile name)