halldyll-parser

High-performance HTML parsing and content extraction library.

Features

Metadata extraction: Title, description, OpenGraph, Twitter Cards, robots, JSON-LD
Content extraction: Headings, paragraphs, lists, tables, code blocks, quotes
Link analysis: Internal/external classification, nofollow detection, URL resolution
Image extraction: With lazy loading, srcset, and accessibility info
Text processing: Boilerplate removal, readability scoring, language detection
Structured data: JSON-LD and Microdata extraction

Quick Start

use halldyll_parser::{HtmlParser, parse};

// Quick parse
let html = "<html><head><title>Test</title></head><body><p>Hello</p></body></html>";
let result = parse(html).unwrap();
println!("Title: {:?}", result.metadata.title);

// With base URL for resolving relative links
let parser = HtmlParser::with_base_url("https://example.com").unwrap();
let result = parser.parse(html).unwrap();

Architecture

This crate is organized into focused modules:

types: All type definitions
selector: CSS selector utilities and caching
metadata: Metadata extraction (OG, Twitter, robots, etc.)
text: Text extraction and processing
links: Link extraction and analysis
content: Structured content extraction (headings, lists, tables, etc.)
parser: Main HtmlParser API

halldyll-parser 0.1.0

halldyll-parser

Features

Quick Start

Architecture