# MetaOxide Feature Expansion Plan: RDFa and Web App Manifest
**Document Version**: 1.0.0
**Created**: 2025-11-25
**Author**: Architecture Team
**Status**: Ready for Implementation
---
## Executive Summary
This document provides a comprehensive implementation plan for adding **RDFa (Resource Description Framework in Attributes)** and **Web App Manifest** support to MetaOxide. These features address:
1. **RDFa**: W3C standard with 62% desktop adoption, enabling semantic web markup extraction
2. **Web App Manifest**: Growing PWA standard for extracting application metadata
The implementation follows MetaOxide's established patterns:
- Rust extractors with `Result<T, ParseError>` return types
- PyO3 Python bindings using `to_py_dict()` pattern
- Integration into `extract_all()` for unified extraction
- Comprehensive test coverage (targeting 1.5-2.0 test/code ratio)
**Estimated Total Effort**: 10-14 engineering days
**Test Target**: 150+ new tests across both features
---
## Table of Contents
1. [Current Architecture Analysis](#current-architecture-analysis)
2. [RDFa Implementation Design](#rdfa-implementation-design)
3. [Web App Manifest Implementation Design](#web-app-manifest-implementation-design)
4. [Technical Debt Assessment](#technical-debt-assessment)
5. [File Structure](#file-structure)
6. [Phase-by-Phase Implementation](#phase-by-phase-implementation)
7. [Testing Strategy](#testing-strategy)
8. [Risk Assessment](#risk-assessment)
9. [Detailed Todo List](#detailed-todo-list)
---
## Current Architecture Analysis
### Established Patterns
The codebase follows consistent patterns that new features MUST adhere to:
#### 1. Extractor Module Pattern
```
src/extractors/{format}/
mod.rs # Main extraction logic with extract() function
tests.rs # Comprehensive test suite
```
#### 2. Type Definition Pattern
```rust
// src/types/{format}.rs
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize, Default)]
pub struct TypeName {
pub field: Option<String>,
// ...
}
impl TypeName {
pub fn to_py_dict(&self, py: Python) -> Py<PyDict> {
// Convert to Python dict
}
}
```
#### 3. Extractor Function Signature
```rust
pub fn extract(html: &str, base_url: Option<&str>) -> Result<TypeName>
```
#### 4. Python Binding Pattern
```rust
#[pyfunction]
#[pyo3(signature = (html, base_url=None))]
fn extract_{format}(py: Python, html: &str, base_url: Option<&str>) -> PyResult<Py<PyDict>> {
let result = extractors::{format}::extract(html, base_url)
.map_err(|e| PyErr::new::<pyo3::exceptions::PyRuntimeError, _>(e.to_string()))?;
Ok(result.to_py_dict(py))
}
```
### SOLID Compliance Review
| **Single Responsibility** | PASS | Each extractor handles one format |
| **Open/Closed** | PASS | New formats added via new modules |
| **Liskov Substitution** | PASS | All extractors follow same interface |
| **Interface Segregation** | PARTIAL | `Extractor` trait in common.rs unused |
| **Dependency Inversion** | PASS | Depends on abstractions (Result, html_utils) |
### DRY Analysis
| URL resolution | Centralized | `common::url_utils::resolve_url` |
| HTML parsing | Centralized | `common::html_utils::parse_html` |
| Selector creation | Centralized | `common::html_utils::create_selector` |
| Text extraction | Centralized | `common::html_utils::extract_text` |
| Attribute extraction | Centralized | `common::html_utils::get_attr` |
| Error types | Centralized | `errors::MicroformatError` |
---
## RDFa Implementation Design
### Overview
RDFa (Resource Description Framework in Attributes) is a W3C standard for embedding structured data in HTML using attributes:
- `vocab` - Vocabulary namespace (e.g., "https://schema.org/")
- `typeof` - Type of the resource (equivalent to `@type`)
- `property` - Property name
- `resource` - URI reference
- `about` - Subject URI
- `prefix` - Custom prefix definitions
- `content` - Override content value
- `datatype` - Data type specification
### Architecture Design
```
src/
types/
rdfa.rs # RDFa type definitions
extractors/
rdfa/
mod.rs # Main extraction logic
value_resolver.rs # Property value resolution
prefix_handler.rs # Prefix/CURIE handling
tests.rs # Test suite
```
### Type Definitions
```rust
// src/types/rdfa.rs
/// RDFa item representing a resource with properties
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize, Default)]
pub struct RdfaItem {
/// The resource type(s) from typeof attribute
#[serde(rename = "type")]
#[serde(skip_serializing_if = "Option::is_none")]
pub type_of: Option<Vec<String>>,
/// The subject URI from about attribute
#[serde(skip_serializing_if = "Option::is_none")]
pub about: Option<String>,
/// The vocabulary namespace
#[serde(skip_serializing_if = "Option::is_none")]
pub vocab: Option<String>,
/// Properties extracted from property attributes
/// Key: property name, Value: array of property values
#[serde(flatten)]
pub properties: HashMap<String, Vec<RdfaValue>>,
}
/// Value of an RDFa property
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
#[serde(untagged)]
pub enum RdfaValue {
/// Literal text value
Literal(String),
/// URI reference
Resource(String),
/// Nested RDFa item
Item(Box<RdfaItem>),
/// Typed literal with datatype
TypedLiteral {
value: String,
datatype: String,
},
}
```
### Extraction Algorithm
1. **Parse HTML document**
2. **Collect prefix definitions** from `prefix` attributes
3. **Find root RDFa elements** (elements with `typeof` or `vocab`)
4. **For each root element**:
a. Extract `vocab`, `typeof`, `about` attributes
b. Resolve CURIE prefixes
c. Traverse descendants for `property` attributes
d. For each property:
- Check for nested `typeof` (nested item)
- Check for `resource`/`href`/`src` (URI value)
- Check for `content` attribute (override)
- Fall back to text content
5. **Return list of RdfaItem**
### SOLID Compliance
| **SRP** | Separate modules for value resolution and prefix handling |
| **OCP** | Extensible property handlers via match patterns |
| **LSP** | RdfaValue enum provides substitutable value types |
| **ISP** | Clean public API with internal helpers |
| **DIP** | Uses common::html_utils abstractions |
---
## Web App Manifest Implementation Design
### Overview
Web App Manifest is a JSON file providing metadata for Progressive Web Apps:
- Discovered via `<link rel="manifest" href="...">`
- Contains: name, short_name, icons, display, start_url, theme_color, etc.
### Architecture Design
```
src/
types/
manifest.rs # Manifest type definitions
extractors/
manifest/
mod.rs # Link discovery + JSON parsing
tests.rs # Test suite
```
### Type Definitions
```rust
// src/types/manifest.rs
/// Web App Manifest metadata
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize, Default)]
pub struct WebAppManifest {
/// Application name
#[serde(skip_serializing_if = "Option::is_none")]
pub name: Option<String>,
/// Short name for home screen
#[serde(skip_serializing_if = "Option::is_none")]
pub short_name: Option<String>,
/// Application description
#[serde(skip_serializing_if = "Option::is_none")]
pub description: Option<String>,
/// Start URL when launched
#[serde(skip_serializing_if = "Option::is_none")]
pub start_url: Option<String>,
/// Display mode (fullscreen, standalone, minimal-ui, browser)
#[serde(skip_serializing_if = "Option::is_none")]
pub display: Option<String>,
/// Screen orientation preference
#[serde(skip_serializing_if = "Option::is_none")]
pub orientation: Option<String>,
/// Theme color for browser chrome
#[serde(skip_serializing_if = "Option::is_none")]
pub theme_color: Option<String>,
/// Background color for splash screen
#[serde(skip_serializing_if = "Option::is_none")]
pub background_color: Option<String>,
/// Application scope
#[serde(skip_serializing_if = "Option::is_none")]
pub scope: Option<String>,
/// Language/direction
#[serde(skip_serializing_if = "Option::is_none")]
pub lang: Option<String>,
/// Text direction (ltr, rtl, auto)
#[serde(skip_serializing_if = "Option::is_none")]
pub dir: Option<String>,
/// Application icons
#[serde(default, skip_serializing_if = "Vec::is_empty")]
pub icons: Vec<ManifestIcon>,
/// Related applications
#[serde(default, skip_serializing_if = "Vec::is_empty")]
pub related_applications: Vec<RelatedApplication>,
/// Prefer related applications over web app
#[serde(skip_serializing_if = "Option::is_none")]
pub prefer_related_applications: Option<bool>,
/// Application categories
#[serde(default, skip_serializing_if = "Vec::is_empty")]
pub categories: Vec<String>,
/// Screenshots for app stores
#[serde(default, skip_serializing_if = "Vec::is_empty")]
pub screenshots: Vec<ManifestImage>,
/// Shortcuts for quick actions
#[serde(default, skip_serializing_if = "Vec::is_empty")]
pub shortcuts: Vec<ManifestShortcut>,
/// Unique identifier
#[serde(skip_serializing_if = "Option::is_none")]
pub id: Option<String>,
}
/// Manifest icon definition
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub struct ManifestIcon {
pub src: String,
#[serde(skip_serializing_if = "Option::is_none")]
pub sizes: Option<String>,
#[serde(rename = "type", skip_serializing_if = "Option::is_none")]
pub mime_type: Option<String>,
#[serde(skip_serializing_if = "Option::is_none")]
pub purpose: Option<String>,
}
/// Related native application
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub struct RelatedApplication {
pub platform: String,
#[serde(skip_serializing_if = "Option::is_none")]
pub url: Option<String>,
#[serde(skip_serializing_if = "Option::is_none")]
pub id: Option<String>,
}
/// Generic manifest image (for screenshots)
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub struct ManifestImage {
pub src: String,
#[serde(skip_serializing_if = "Option::is_none")]
pub sizes: Option<String>,
#[serde(rename = "type", skip_serializing_if = "Option::is_none")]
pub mime_type: Option<String>,
#[serde(skip_serializing_if = "Option::is_none")]
pub label: Option<String>,
}
/// Shortcut definition
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub struct ManifestShortcut {
pub name: String,
pub url: String,
#[serde(skip_serializing_if = "Option::is_none")]
pub short_name: Option<String>,
#[serde(skip_serializing_if = "Option::is_none")]
pub description: Option<String>,
#[serde(default, skip_serializing_if = "Vec::is_empty")]
pub icons: Vec<ManifestIcon>,
}
/// Manifest link discovery result
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize, Default)]
pub struct ManifestDiscovery {
/// The manifest link URL (resolved)
#[serde(skip_serializing_if = "Option::is_none")]
pub href: Option<String>,
/// Parsed manifest content (if provided)
#[serde(skip_serializing_if = "Option::is_none")]
pub manifest: Option<WebAppManifest>,
}
```
### Implementation Approach
The manifest extractor has two modes:
1. **Discovery Only** (from HTML): Extracts `<link rel="manifest">` URL
2. **Full Parse** (with manifest content): Parses manifest JSON
```rust
/// Discover manifest link from HTML
pub fn extract_link(html: &str, base_url: Option<&str>) -> Result<ManifestDiscovery>
/// Parse manifest JSON content
pub fn parse_manifest(json: &str, base_url: Option<&str>) -> Result<WebAppManifest>
/// Combined: discover link from HTML (manifest content provided separately)
pub fn extract(html: &str, base_url: Option<&str>) -> Result<ManifestDiscovery>
```
---
## Technical Debt Assessment
### Existing Technical Debt
| TD-001 | Architecture | MEDIUM | `Extractor` trait in common.rs is defined but unused | Consider removing or implementing consistently |
| TD-002 | Testing | LOW | Some extractors lack edge case coverage | Add tests as part of new feature work |
| TD-003 | Documentation | LOW | API reference docs incomplete for some types | Update docs with new features |
### New Technical Debt to Avoid
| Duplicating URL resolution logic | MUST use `common::url_utils::resolve_url` |
| Inconsistent error handling | MUST use `errors::MicroformatError` variants |
| Missing Python bindings | MUST implement `to_py_dict()` for all new types |
| Incomplete test coverage | MUST achieve 1.5x test/code ratio minimum |
---
## File Structure
### New Files to Create
```
src/
types/
rdfa.rs # NEW: RDFa type definitions
manifest.rs # NEW: Web App Manifest types
extractors/
rdfa/
mod.rs # NEW: RDFa extraction
tests.rs # NEW: RDFa tests
manifest/
mod.rs # NEW: Manifest extraction
tests.rs # NEW: Manifest tests
```
### Files to Modify
```
src/
types/
mod.rs # ADD: pub mod rdfa; pub mod manifest;
extractors/
mod.rs # ADD: pub mod rdfa; pub mod manifest;
lib.rs # ADD: Python bindings, extract_all() integration
errors.rs # OPTIONAL: Add new error variants if needed
```
---
## Phase-by-Phase Implementation
### Phase 1: RDFa Core Implementation (5-7 days)
#### Phase 1.1: Type Definitions (1 day)
**Tasks:**
1. Create `src/types/rdfa.rs` with `RdfaItem`, `RdfaValue` types
2. Implement `Default`, `Clone`, `PartialEq`, `Serialize`, `Deserialize`
3. Implement `to_py_dict()` for all types
4. Add unit tests for type conversions
5. Register module in `src/types/mod.rs`
**Acceptance Criteria:**
- All types compile without warnings
- `to_py_dict()` produces valid Python dicts
- Unit tests pass for serialization round-trips
#### Phase 1.2: Basic Extraction (2 days)
**Tasks:**
1. Create `src/extractors/rdfa/mod.rs`
2. Implement basic `extract()` function
3. Handle `vocab`, `typeof`, `property` attributes
4. Extract literal property values
5. Register module in `src/extractors/mod.rs`
**Acceptance Criteria:**
- Extracts simple RDFa markup (vocab + typeof + property)
- Handles Schema.org vocabulary correctly
- Returns `Result<Vec<RdfaItem>>`
#### Phase 1.3: Advanced Features (2 days)
**Tasks:**
1. Implement prefix/CURIE handling
2. Implement `resource`, `about` attribute handling
3. Implement `content` attribute override
4. Implement `datatype` support
5. Handle nested RDFa items (typeof within typeof)
**Acceptance Criteria:**
- CURIE prefixes expand correctly
- Nested items extracted properly
- URI values resolved against base URL
#### Phase 1.4: Python Bindings (0.5 days)
**Tasks:**
1. Add `extract_rdfa()` function to `lib.rs`
2. Add to `extract_all()` aggregation
3. Register in PyO3 module
**Acceptance Criteria:**
- `meta_oxide.extract_rdfa(html)` works from Python
- `extract_all()` includes `rdfa` key when present
#### Phase 1.5: Testing (1-1.5 days)
**Tasks:**
1. Create `src/extractors/rdfa/tests.rs`
2. Unit tests for all extraction functions
3. Integration tests with real-world RDFa examples
4. Edge case tests (malformed markup, missing attributes)
5. Python binding tests
**Test Coverage Target:** 80+ tests
---
### Phase 2: Web App Manifest Implementation (3-4 days)
#### Phase 2.1: Type Definitions (0.5 days)
**Tasks:**
1. Create `src/types/manifest.rs`
2. Define `WebAppManifest`, `ManifestIcon`, etc.
3. Implement serde derives with proper renaming
4. Implement `to_py_dict()` for all types
5. Register module in `src/types/mod.rs`
**Acceptance Criteria:**
- All types derive required traits
- JSON deserialization works correctly
- Python conversion produces valid dicts
#### Phase 2.2: Link Discovery (0.5 days)
**Tasks:**
1. Create `src/extractors/manifest/mod.rs`
2. Implement `extract_link()` function
3. Find `<link rel="manifest">` tags
4. Resolve relative URLs
5. Register module in `src/extractors/mod.rs`
**Acceptance Criteria:**
- Discovers manifest links from HTML
- Handles relative URLs correctly
- Returns `ManifestDiscovery` with href
#### Phase 2.3: JSON Parsing (1 day)
**Tasks:**
1. Implement `parse_manifest()` function
2. Handle all standard manifest fields
3. Resolve relative URLs in manifest (icons, start_url, etc.)
4. Handle missing/optional fields gracefully
**Acceptance Criteria:**
- Parses valid manifest.json correctly
- Handles partial manifests
- Resolves relative URLs
#### Phase 2.4: Python Bindings (0.5 days)
**Tasks:**
1. Add `extract_manifest()` for link discovery
2. Add `parse_manifest()` for JSON parsing
3. Add to `extract_all()` aggregation
4. Register in PyO3 module
**Acceptance Criteria:**
- Python API works as expected
- `extract_all()` includes `manifest` key
#### Phase 2.5: Testing (1-1.5 days)
**Tasks:**
1. Create `src/extractors/manifest/tests.rs`
2. Unit tests for link discovery
3. Unit tests for JSON parsing
4. Real-world manifest examples
5. Edge cases and error handling
**Test Coverage Target:** 50+ tests
---
### Phase 3: Integration and Documentation (2-3 days)
#### Phase 3.1: Integration Testing (1 day)
**Tasks:**
1. Integration tests with combined formats
2. Performance benchmarking
3. Memory usage verification
4. Python integration tests
#### Phase 3.2: Documentation (1 day)
**Tasks:**
1. Update README.md with new features
2. Add API reference for new functions
3. Create usage examples
4. Update ROADMAP.md
#### Phase 3.3: Final Review (0.5-1 day)
**Tasks:**
1. Run full test suite
2. Run clippy with all features
3. Review for SOLID compliance
4. Address any remaining TODOs
5. Version bump preparation
---
## Testing Strategy
### Test Categories
| **Unit Tests** | Individual function testing | 80+ |
| **Integration Tests** | Combined format extraction | 20+ |
| **Edge Case Tests** | Malformed input, boundaries | 30+ |
| **Real-World Tests** | Actual website examples | 15+ |
| **Python Tests** | Binding verification | 10+ |
### RDFa Test Scenarios
```rust
// Basic extraction
#[test] fn test_rdfa_basic_vocab_typeof_property()
#[test] fn test_rdfa_schema_org_person()
#[test] fn test_rdfa_schema_org_product()
// Prefix handling
#[test] fn test_rdfa_prefix_definition()
#[test] fn test_rdfa_curie_expansion()
#[test] fn test_rdfa_default_vocab()
// Value types
#[test] fn test_rdfa_literal_value()
#[test] fn test_rdfa_resource_value()
#[test] fn test_rdfa_content_override()
#[test] fn test_rdfa_datatype_handling()
// Nested items
#[test] fn test_rdfa_nested_typeof()
#[test] fn test_rdfa_deeply_nested()
// Edge cases
#[test] fn test_rdfa_empty_html()
#[test] fn test_rdfa_no_rdfa_markup()
#[test] fn test_rdfa_malformed_attributes()
#[test] fn test_rdfa_missing_vocab()
// Real-world
#[test] fn test_rdfa_wikipedia_example()
#[test] fn test_rdfa_government_site()
```
### Manifest Test Scenarios
```rust
// Link discovery
#[test] fn test_manifest_link_discovery()
#[test] fn test_manifest_relative_url()
#[test] fn test_manifest_no_link()
// JSON parsing
#[test] fn test_manifest_full_parse()
#[test] fn test_manifest_minimal()
#[test] fn test_manifest_icons_array()
#[test] fn test_manifest_shortcuts()
// URL resolution
#[test] fn test_manifest_resolve_icon_urls()
#[test] fn test_manifest_resolve_start_url()
// Edge cases
#[test] fn test_manifest_invalid_json()
#[test] fn test_manifest_empty_json()
#[test] fn test_manifest_extra_fields()
```
---
## Risk Assessment
### High Risks
| RDFa spec complexity | High | High | Start with subset, expand iteratively |
| Nested RDFa edge cases | Medium | High | Extensive test coverage |
| Performance with large documents | Medium | Medium | Benchmark and optimize |
### Medium Risks
| Manifest spec evolution | Medium | Medium | Use serde's `#[serde(flatten)]` for unknown fields |
| Python binding edge cases | Low | Medium | Comprehensive Python tests |
| Integration conflicts | Low | Medium | Incremental integration |
### Low Risks
| Breaking existing tests | Low | Low | Run full suite frequently |
| Documentation gaps | Medium | Low | Document as implementing |
---
## Detailed Todo List
### Phase 1: RDFa Implementation
```
## Phase 1.1: RDFa Type Definitions (1 day)
- [ ] (M) Create src/types/rdfa.rs with RdfaItem struct
- [ ] (S) Add RdfaValue enum (Literal, Resource, Item, TypedLiteral)
- [ ] (S) Implement Default trait for RdfaItem
- [ ] (S) Implement Clone, PartialEq, Debug for all types
- [ ] (M) Implement Serialize/Deserialize with serde
- [ ] (M) Implement to_py_dict() for RdfaItem
- [ ] (M) Implement to_py_dict() for RdfaValue
- [ ] (S) Add module to src/types/mod.rs
- [ ] (S) Unit tests for RdfaItem creation
- [ ] (S) Unit tests for RdfaValue variants
- [ ] (M) Unit tests for serde round-trip
- [ ] (M) Unit tests for to_py_dict()
## Phase 1.2: RDFa Basic Extraction (2 days)
- [ ] (M) Create src/extractors/rdfa/mod.rs
- [ ] (M) Implement extract() function signature
- [ ] (M) Implement find_rdfa_roots() helper
- [ ] (L) Implement extract_item() for single root
- [ ] (M) Implement extract_properties() for descendants
- [ ] (M) Implement extract_property_value() helper
- [ ] (S) Handle vocab attribute
- [ ] (S) Handle typeof attribute (single and multiple)
- [ ] (M) Handle property attribute
- [ ] (S) Handle about attribute
- [ ] (S) Add module to src/extractors/mod.rs
- [ ] (M) Tests: basic vocab + typeof + property
- [ ] (M) Tests: Schema.org Person
- [ ] (M) Tests: Schema.org Article
- [ ] (S) Tests: multiple items on page
## Phase 1.3: RDFa Advanced Features (2 days)
- [ ] (L) Implement prefix_handler module
- [ ] (M) Parse prefix attribute syntax
- [ ] (M) Implement CURIE expansion
- [ ] (M) Handle default prefixes (schema:, foaf:, dc:)
- [ ] (M) Implement resource attribute handling
- [ ] (S) Implement content attribute override
- [ ] (M) Implement datatype attribute support
- [ ] (L) Implement nested item extraction
- [ ] (M) Handle property scope correctly
- [ ] (M) URL resolution for resource values
- [ ] (M) Tests: prefix definitions
- [ ] (M) Tests: CURIE expansion
- [ ] (M) Tests: nested typeof
- [ ] (M) Tests: content override
- [ ] (S) Tests: datatype handling
## Phase 1.4: RDFa Python Bindings (0.5 days)
- [ ] (M) Add extract_rdfa() pyfunction to lib.rs
- [ ] (M) Add proper docstring with examples
- [ ] (M) Add to pymodule registration
- [ ] (M) Add to extract_all() aggregation
- [ ] (S) Test: Python import and basic call
- [ ] (S) Test: Python with base_url
- [ ] (S) Test: extract_all() includes rdfa
## Phase 1.5: RDFa Testing (1.5 days)
- [ ] (M) Create src/extractors/rdfa/tests.rs
- [ ] (S) Tests: empty HTML
- [ ] (S) Tests: no RDFa markup
- [ ] (M) Tests: malformed typeof
- [ ] (M) Tests: missing vocab
- [ ] (M) Tests: deeply nested (5+ levels)
- [ ] (M) Tests: multiple vocabs on page
- [ ] (L) Tests: real-world Wikipedia example
- [ ] (L) Tests: real-world government site
- [ ] (M) Tests: Unicode in values
- [ ] (S) Tests: HTML entities
- [ ] (M) Verify test count >= 80
```
### Phase 2: Web App Manifest Implementation
```
## Phase 2.1: Manifest Type Definitions (0.5 days)
- [ ] (M) Create src/types/manifest.rs
- [ ] (M) Define WebAppManifest struct
- [ ] (S) Define ManifestIcon struct
- [ ] (S) Define RelatedApplication struct
- [ ] (S) Define ManifestImage struct
- [ ] (S) Define ManifestShortcut struct
- [ ] (S) Define ManifestDiscovery struct
- [ ] (M) Implement serde derives with proper field renaming
- [ ] (M) Implement to_py_dict() for WebAppManifest
- [ ] (S) Implement to_py_dict() for sub-types
- [ ] (S) Add module to src/types/mod.rs
- [ ] (M) Unit tests for JSON deserialization
- [ ] (M) Unit tests for to_py_dict()
## Phase 2.2: Manifest Link Discovery (0.5 days)
- [ ] (M) Create src/extractors/manifest/mod.rs
- [ ] (M) Implement extract_link() function
- [ ] (S) Find <link rel="manifest"> tags
- [ ] (M) Handle relative URL resolution
- [ ] (S) Return ManifestDiscovery
- [ ] (S) Add module to src/extractors/mod.rs
- [ ] (S) Tests: link discovery basic
- [ ] (S) Tests: relative URL
- [ ] (S) Tests: no manifest link
## Phase 2.3: Manifest JSON Parsing (1 day)
- [ ] (M) Implement parse_manifest() function
- [ ] (M) Parse all standard fields
- [ ] (M) Handle icons array with URL resolution
- [ ] (M) Handle shortcuts array
- [ ] (M) Handle related_applications array
- [ ] (M) Handle screenshots array
- [ ] (S) Handle optional fields gracefully
- [ ] (M) Resolve relative URLs in manifest
- [ ] (M) Tests: full manifest parse
- [ ] (M) Tests: minimal manifest
- [ ] (S) Tests: icons only
- [ ] (S) Tests: shortcuts
- [ ] (S) Tests: invalid JSON
## Phase 2.4: Manifest Python Bindings (0.5 days)
- [ ] (M) Add extract_manifest() pyfunction
- [ ] (M) Add parse_manifest() pyfunction
- [ ] (M) Add to pymodule registration
- [ ] (M) Add to extract_all() aggregation
- [ ] (S) Test: Python extract_manifest()
- [ ] (S) Test: Python parse_manifest()
- [ ] (S) Test: extract_all() includes manifest
## Phase 2.5: Manifest Testing (1 day)
- [ ] (M) Create src/extractors/manifest/tests.rs
- [ ] (S) Tests: empty manifest
- [ ] (S) Tests: extra unknown fields
- [ ] (M) Tests: all icon sizes
- [ ] (M) Tests: related applications
- [ ] (L) Tests: real-world PWA manifest
- [ ] (M) Tests: Twitter PWA manifest
- [ ] (S) Tests: malformed JSON recovery
- [ ] (M) Verify test count >= 50
```
### Phase 3: Integration and Documentation
```
## Phase 3.1: Integration Testing (1 day)
- [ ] (M) Integration test: RDFa + JSON-LD on same page
- [ ] (M) Integration test: Manifest + Open Graph
- [ ] (M) Integration test: All formats combined
- [ ] (L) Performance benchmark with large HTML
- [ ] (M) Memory usage verification
- [ ] (M) Python integration tests
## Phase 3.2: Documentation (1 day)
- [ ] (M) Update README.md with RDFa section
- [ ] (M) Update README.md with Manifest section
- [ ] (M) Add RDFa examples to docs/examples.md
- [ ] (M) Add Manifest examples to docs/examples.md
- [ ] (M) Update docs/api-reference.md
- [ ] (S) Update ROADMAP.md to mark complete
- [ ] (S) Update CHANGELOG.md
## Phase 3.3: Final Review (0.5 days)
- [ ] (S) Run cargo test --all-features
- [ ] (S) Run cargo clippy --all-features
- [ ] (S) Run cargo doc --no-deps
- [ ] (M) Review all new code for SOLID compliance
- [ ] (S) Check for any remaining TODO comments
- [ ] (S) Prepare version bump (0.1.0)
```
---
## Success Metrics
| Metric | Target | Measurement |
|--------|--------|-------------|
| Test count | 150+ new tests | `cargo test -- --list | grep -c test` |
| Test coverage | 85%+ | Coverage tool |
| Clippy warnings | 0 | `cargo clippy --all-features` |
| Doc coverage | 100% public items | `cargo doc --no-deps` |
| Performance | <10ms for typical page | Benchmarks |
---
## Agent Assignment Guide
### For Staff Rust Engineer
**Primary Tasks:**
- Phase 1.1: Type Definitions
- Phase 1.2-1.3: Core Extraction Logic
- Phase 2.1-2.3: Manifest Implementation
- Phase 3.3: Final Review
**Key Files:**
- `src/types/rdfa.rs`
- `src/types/manifest.rs`
- `src/extractors/rdfa/mod.rs`
- `src/extractors/manifest/mod.rs`
### For Staff Python Engineer
**Primary Tasks:**
- Phase 1.4: RDFa Python Bindings
- Phase 2.4: Manifest Python Bindings
- All Python integration tests
**Key Files:**
- `src/lib.rs` (Python binding functions)
- Python test files
### For Technical Writer
**Primary Tasks:**
- Phase 3.2: Documentation
- README updates
- API reference updates
**Key Files:**
- `README.md`
- `docs/examples.md`
- `docs/api-reference.md`
- `CHANGELOG.md`
---
## Appendix A: RDFa Specification Reference
### Attribute Priority
1. `content` attribute (highest priority)
2. `resource`, `href`, `src` attributes (for URIs)
3. Element text content (lowest priority)
### Common Prefixes
| Prefix | Namespace |
|--------|-----------|
| `schema` | https://schema.org/ |
| `foaf` | http://xmlns.com/foaf/0.1/ |
| `dc` | http://purl.org/dc/terms/ |
| `og` | http://ogp.me/ns# |
### Example RDFa Markup
```html
<div vocab="https://schema.org/" typeof="Person">
<span property="name">Jane Doe</span>
<span property="jobTitle">Professor</span>
<a property="url" href="https://example.com/jane">Homepage</a>
<div property="address" typeof="PostalAddress">
<span property="streetAddress">123 Main St</span>
<span property="addressLocality">Springfield</span>
</div>
</div>
```
---
## Appendix B: Web App Manifest Reference
### Example Manifest
```json
{
"name": "My Progressive Web App",
"short_name": "MyPWA",
"description": "A sample PWA",
"start_url": "/",
"display": "standalone",
"theme_color": "#3367D6",
"background_color": "#FFFFFF",
"icons": [
{
"src": "/icons/icon-192.png",
"sizes": "192x192",
"type": "image/png"
},
{
"src": "/icons/icon-512.png",
"sizes": "512x512",
"type": "image/png"
}
],
"shortcuts": [
{
"name": "New Item",
"url": "/new",
"icons": [{"src": "/icons/new.png", "sizes": "96x96"}]
}
]
}
```
---
**Document End**