halldyll-parser
High-performance HTML parsing and content extraction library.
Features
- Metadata extraction: Title, description, OpenGraph, Twitter Cards, robots, JSON-LD
- Content extraction: Headings, paragraphs, lists, tables, code blocks, quotes
- Link analysis: Internal/external classification, nofollow detection, URL resolution
- Image extraction: With lazy loading, srcset, and accessibility info
- Text processing: Boilerplate removal, readability scoring, language detection
- Structured data: JSON-LD and Microdata extraction
Quick Start
use ;
// Quick parse
let html = "<html><head><title>Test</title></head><body><p>Hello</p></body></html>";
let result = parse.unwrap;
println!;
// With base URL for resolving relative links
let parser = with_base_url.unwrap;
let result = parser.parse.unwrap;
Architecture
This crate is organized into focused modules:
types: All type definitionsselector: CSS selector utilities and cachingmetadata: Metadata extraction (OG, Twitter, robots, etc.)text: Text extraction and processinglinks: Link extraction and analysiscontent: Structured content extraction (headings, lists, tables, etc.)parser: Main HtmlParser API