meta_oxide 0.1.1

Universal metadata extraction library supporting 13 formats (HTML Meta, Open Graph, Twitter Cards, JSON-LD, Microdata, Microformats, RDFa, Dublin Core, Web App Manifest, oEmbed, rel-links, Images, SEO) with 7 language bindings
Documentation
# Architecture

This document describes the architecture and design decisions of MetaOxide.

## Overview

MetaOxide is designed as a high-performance Rust library with Python bindings for extracting microformats from HTML. It uses a modular architecture that separates concerns and makes it easy to add new microformat types.

## Component Architecture

```
┌─────────────────────────────────────────────┐
│           Python Interface (PyO3)           │
│  extract_microformats(), extract_hcard()   │
└─────────────────┬───────────────────────────┘
┌─────────────────▼───────────────────────────┐
│              Core Library                   │
│  ┌───────────┐  ┌──────────┐  ┌──────────┐ │
│  │  Parser   │  │  Types   │  │  Errors  │ │
│  └─────┬─────┘  └──────────┘  └──────────┘ │
│        │                                    │
│  ┌─────▼─────────────────────────────────┐ │
│  │         Extractors Module             │ │
│  │  ┌────────┐ ┌─────────┐ ┌──────────┐ │ │
│  │  │ h-card │ │ h-entry │ │ h-event  │ │ │
│  │  └────────┘ └─────────┘ └──────────┘ │ │
│  └───────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
┌─────────────────▼───────────────────────────┐
│         External Dependencies               │
│   scraper, pyo3, serde, url, thiserror     │
└─────────────────────────────────────────────┘
```

## Core Components

### 1. Parser Module (`src/parser.rs`)

The parser module is responsible for:
- Parsing HTML using the `scraper` crate
- Identifying microformat root classes (h-*)
- Extracting properties (p-*, u-*, dt-*, e-*)
- Resolving relative URLs
- Building the microformat data structure

**Key Functions:**
- `parse_html()`: Main entry point for parsing
- `parse_microformat_item()`: Extracts a single microformat item
- `extract_properties()`: Recursively extracts properties
- `resolve_url()`: Resolves relative URLs against a base URL

### 2. Types Module (`src/types.rs`)

Defines the data structures used throughout the library:

**Generic Types:**
- `MicroformatItem`: Generic microformat representation
- `PropertyValue`: Enum for different property value types (text, URL, nested)

**Specific Microformat Types:**
- `HCard`: Contact information
- `HEntry`: Blog posts and articles
- `HEvent`: Events

All types implement:
- `Serialize`/`Deserialize` for JSON conversion
- `to_py_dict()` for Python interop via PyO3

### 3. Extractors Module (`src/extractors/`)

Specialized extractors for each microformat type:

- **hcard.rs**: Extracts h-card microformats (contact info)
- **hentry.rs**: Extracts h-entry microformats (blog posts)
- **hevent.rs**: Extracts h-event microformats (events)

Each extractor:
1. Uses CSS selectors to find relevant elements
2. Extracts specific properties for that microformat type
3. Handles nested microformats (e.g., author h-card in h-entry)
4. Returns strongly-typed Rust structs

### 4. Error Handling (`src/errors.rs`)

Centralized error handling using `thiserror`:

```rust
pub enum MicroformatError {
    ParseError(String),
    InvalidUrl(url::ParseError),
    MissingProperty(String),
    InvalidStructure(String),
    ExtractionFailed(String),
}
```

Errors are converted to Python exceptions when crossing the FFI boundary.

### 5. Python Interface (`src/lib.rs`)

PyO3 bindings that expose Rust functionality to Python:

- `extract_microformats()`: Extract all microformats
- `extract_hcard()`: Extract h-card only
- `extract_hentry()`: Extract h-entry only
- `extract_hevent()`: Extract h-event only

## Design Decisions

### Why Rust + PyO3?

1. **Performance**: Rust provides C-level performance for HTML parsing
2. **Safety**: Memory safety without garbage collection
3. **Python Integration**: PyO3 makes it easy to create Python packages
4. **Type Safety**: Strong typing catches errors at compile time

### Modular Extractor Design

Each microformat type has its own extractor module because:

1. **Separation of Concerns**: Each format has unique properties
2. **Maintainability**: Easy to add new formats without affecting existing ones
3. **Testing**: Can test each format independently
4. **Extensibility**: Users can add custom extractors

### CSS Selectors vs Manual Parsing

We use the `scraper` crate with CSS selectors because:

1. **Readability**: `.h-card` is clearer than manual DOM traversal
2. **Reliability**: Well-tested library used by many projects
3. **Standards-Compliant**: Follows CSS selector specifications
4. **Performance**: Optimized selector engine

### Property Value Types

The `PropertyValue` enum supports three types:

```rust
pub enum PropertyValue {
    Text(String),      // p-* properties
    Url(String),       // u-* properties
    Nested(Box<MicroformatItem>),  // nested microformats
}
```

This design:
- Preserves type information
- Enables proper URL resolution
- Supports nested structures
- Converts cleanly to Python types

## Data Flow

### Extraction Flow

1. **Input**: HTML string + optional base URL
2. **Parse**: HTML is parsed into a DOM tree
3. **Identify**: Find elements with microformat classes
4. **Extract**: For each root element:
   - Identify type classes (h-*)
   - Extract properties (p-*, u-*, dt-*, e-*)
   - Resolve relative URLs
   - Handle nested microformats
5. **Convert**: Convert to Python objects
6. **Return**: Python dictionaries/lists

### Example Flow Diagram

```
HTML Input
scraper::Html::parse_document()
Find .h-card elements
For each h-card:
  ├─ Extract .p-name → PropertyValue::Text
  ├─ Extract .u-url → PropertyValue::Url (resolved)
  ├─ Extract .u-email → PropertyValue::Url
  └─ Build HCard struct
Convert to PyDict
Return to Python
```

## Performance Considerations

### Optimization Strategies

1. **Single-Pass Parsing**: Parse HTML once, extract all formats
2. **Lazy Evaluation**: Only parse requested microformat types
3. **Zero-Copy Where Possible**: Use string slices instead of clones
4. **Release Optimizations**: LTO and optimization level 3

### Memory Management

- Rust's ownership system prevents memory leaks
- Python objects are managed by PyO3's GIL integration
- Large HTML documents are parsed with streaming where possible

## Testing Strategy

### Unit Tests

Each extractor has unit tests in the same file:

```rust
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_extract_hcard() {
        // Test implementation
    }
}
```

### Integration Tests

Located in `tests/` directory for end-to-end testing.

### Python Tests

Test Python bindings with pytest:

```python
def test_extract_hcard():
    result = meta_oxide.extract_hcard(html)
    assert len(result) == 1
```

## Future Enhancements

### Planned Features

1. **More Microformats**: h-feed, h-review, h-product
2. **Streaming Parser**: For very large documents
3. **Custom Extractors**: Allow users to define custom formats
4. **Validation**: Validate microformat structure
5. **Performance Metrics**: Built-in benchmarking

### Extensibility Points

1. **Custom Property Extractors**: Override default extraction logic
2. **Custom Validators**: Add validation rules
3. **Custom Converters**: Convert to different output formats
4. **Plugin System**: Load extractors dynamically

## Dependencies

### Core Dependencies

- **pyo3**: Python bindings and FFI
- **scraper**: HTML parsing with CSS selectors
- **serde**: Serialization framework
- **url**: URL parsing and resolution
- **thiserror**: Error handling

### Why These Dependencies?

- **scraper**: Best-in-class HTML parser for Rust
- **pyo3**: Official Python bindings for Rust
- **serde**: Industry-standard serialization
- **url**: WHATWG URL standard implementation

## Contributing

See [Development Guide](development.md) for information on:
- Setting up the development environment
- Adding new microformat types
- Running tests
- Submitting pull requests