url-preview 0.6.0

# URL Preview Library - Project Status

## Overview

URL Preview is a high-performance Rust library for generating rich URL previews with support for:
- Standard metadata extraction (Open Graph, Twitter Cards, etc.)
- Browser-based rendering for JavaScript-heavy sites
- LLM-powered structured data extraction
- Platform-specific handlers (Twitter/X, GitHub)
- Comprehensive caching and concurrency control

## Current Version: 0.6.0

## Feature Status

### ✅ Core Features (Stable)

1. **Basic Preview Generation**
   - HTML fetching and parsing
   - Metadata extraction (title, description, images)
   - Open Graph and Twitter Card support
   - Favicon detection
   - Character encoding handling

2. **Caching System**
   - Thread-safe DashMap-based cache
   - Configurable capacity and TTL
   - LRU eviction policy
   - Feature flag: `cache`

3. **Security & Validation**
   - URL validation
   - Content size limits
   - Request timeouts
   - Redirect following with limits

4. **Platform Handlers**
   - GitHub: Repository statistics via API
   - Twitter/X: oEmbed API integration
   - Feature flags: `github`, `twitter`

### ✅ Browser Rendering (Implemented)

**Status**: Fully implemented with MCP (Model Context Protocol) integration

**Components**:
- `McpClient`: Complete JSON-RPC communication layer
- `BrowserFetcher`: High-level browser automation
- `BrowserPreviewService`: Integrated preview generation

**Key Features**:
- Automatic SPA detection
- JavaScript rendering support
- Browser usage policies (Always/Never/Auto)
- Playwright integration via MCP
- Screenshot capabilities

**Example Usage**:
```rust
let mcp_config = McpConfig {
    enabled: true,
    server_command: vec!["npx", "-y", "@playwright/mcp@latest"],
    transport: McpTransport::Stdio,
    browser_timeout: 30,
    max_sessions: 5,
};

let browser_service = BrowserPreviewService::new(mcp_config, BrowserUsagePolicy::Auto);
browser_service.initialize().await?;
let preview = browser_service.generate_preview(url).await?;
```

**Run Examples**:
```bash
cargo run --example browser_preview --features browser
cargo run --example test_browser_simple --features browser
```

### ✅ LLM Integration (Implemented)

**Status**: Core functionality implemented with provider system

**Providers**:
- ✅ `MockProvider`: For testing
- ✅ `OpenAIProvider`: Full implementation with function calling
- 🚧 `AnthropicProvider`: Placeholder (awaiting official SDK)
- ✅ `ClaudeCompatProvider`: Claude via OpenAI-compatible endpoints
- ✅ `ClaudeCodeProvider`: Claude via claude-code-sdk (pending SDK fix)
- 🚧 `LocalProvider`: Placeholder for Ollama/local models

**Key Features**:
- Type-safe structured data extraction
- Automatic JSON schema generation
- Multiple content formats (HTML, Markdown, Text)
- Token usage tracking
- Optional caching support

**Example Usage**:
```rust
#[derive(Serialize, Deserialize, JsonSchema)]
struct ProductInfo {
    name: String,
    price: Option<String>,
    description: String,
}

let provider = Arc::new(OpenAIProvider::new(api_key));
let extractor = LLMExtractor::new(provider);
let result = extractor.extract::<ProductInfo>(url, &fetcher).await?;
```

**Run Examples**:
```bash
# With mock provider
cargo run --example test_llm_extraction --features "llm cache"

# With OpenAI
OPENAI_API_KEY=your_key cargo run --example test_llm_extraction --features "llm cache"

# Combined with browser
cargo run --example test_llm_with_browser --features "browser llm"
```

### 📝 Claude API Support

Claude can be used through multiple methods:

1. **Claude Code SDK** (Direct CLI integration):
   ```rust
   let provider = Arc::new(ClaudeCodeProvider::new());
   let extractor = LLMExtractor::new(provider);
   ```
   Note: Requires fixing a compilation issue in claude-code-sdk

2. **OpenRouter** (Recommended for API access):
   ```bash
   OPENROUTER_API_KEY=your_key cargo run --example claude_via_openrouter --features llm
   ```

3. **OpenAI-compatible proxies**: Helicone, LiteLLM, One API

4. **ClaudeCompatProvider**: Built-in provider for OpenAI-compatible endpoints

See `doc/claude_code_integration.md` for detailed integration guide.

## Project Structure

```
url-preview/
├── src/
│   ├── lib.rs                 # Public API exports
│   ├── preview_service.rs     # High-level service interface
│   ├── fetcher.rs            # HTTP client wrapper
│   ├── extractor.rs          # Metadata extraction logic
│   ├── cache.rs              # Caching implementation
│   ├── browser_fetcher.rs    # Browser automation
│   ├── mcp_client.rs         # MCP protocol implementation
│   ├── llm_extractor.rs      # LLM extraction logic
│   └── llm_providers.rs      # LLM provider implementations
├── examples/
│   ├── browser_preview.rs    # Browser rendering demo
│   ├── test_llm_extraction.rs # LLM extraction demo
│   └── ...                   # Many more examples
├── tests/
│   ├── browser_tests.rs      # Browser integration tests
│   ├── llm_tests.rs         # LLM extraction tests
│   └── ...                  # Comprehensive test suite
└── doc/
    ├── testing_guide.md      # Testing documentation
    └── project_status.md     # This file
```

## Dependencies

### Core Dependencies
- `tokio`: Async runtime
- `reqwest`: HTTP client
- `scraper`: HTML parsing
- `url`: URL parsing
- `serde`/`serde_json`: Serialization

### Optional Dependencies
- `dashmap`: Cache implementation (`cache` feature)
- `jsonrpc-core`: MCP protocol (`browser` feature)
- `async-openai`: OpenAI integration (`llm` feature)
- `schemars`: JSON schema generation (`llm` feature)
- `base64`: Screenshot decoding (`browser` feature)

## Testing

### Unit Tests
```bash
cargo test
```

### Feature-Specific Tests
```bash
# Browser features
cargo test --features browser

# LLM features
cargo test --features llm

# All features
cargo test --features full
```

### Integration Tests
- Browser tests require Node.js and npx
- LLM tests can run with mock provider or real API keys

## Known Limitations

1. **Browser Feature**:
   - Requires Node.js and npx installed
   - First run downloads Playwright browsers
   - Some sites may have anti-automation measures

2. **LLM Feature**:
   - OpenAI provider requires API key
   - Anthropic provider is placeholder only
   - Token limits apply based on model

3. **General**:
   - No streaming support for large responses
   - Limited multi-language support
   - No built-in rate limiting

## Future Enhancements

1. **Browser Improvements**:
   - Cookie/session management
   - Custom viewport sizes
   - Network request interception
   - Performance metrics

2. **LLM Enhancements**:
   - Native Anthropic SDK integration
   - Streaming response support
   - Local model support (Ollama)
   - Multi-modal extraction (with screenshots)

3. **Core Features**:
   - WebAssembly support
   - Batch processing optimization
   - Enhanced error recovery
   - Metrics and monitoring

## Performance Considerations

- Concurrent request limit: Configurable (default 10)
- Cache capacity: Configurable (default 1000 entries)
- Browser timeout: 30 seconds default
- LLM timeout: Based on model and content size

## API Stability

- Core preview API: Stable
- Browser API: Stable
- LLM API: Stable (providers may change)
- Internal modules: Subject to change

## Contributing

See CONTRIBUTING.md for guidelines. Key areas for contribution:
- Additional LLM provider implementations
- Platform-specific extractors
- Performance optimizations
- Documentation improvements

## License

MIT License - See LICENSE file for details.