halldyll-parser 0.1.0

HTML/CSS parsing and content extraction for halldyll scraper
Documentation

halldyll-parser

High-performance HTML parsing and content extraction library.

Features

  • Metadata extraction: Title, description, OpenGraph, Twitter Cards, robots, JSON-LD
  • Content extraction: Headings, paragraphs, lists, tables, code blocks, quotes
  • Link analysis: Internal/external classification, nofollow detection, URL resolution
  • Image extraction: With lazy loading, srcset, and accessibility info
  • Text processing: Boilerplate removal, readability scoring, language detection
  • Structured data: JSON-LD and Microdata extraction

Quick Start

use halldyll_parser::{HtmlParser, parse};

// Quick parse
let html = "<html><head><title>Test</title></head><body><p>Hello</p></body></html>";
let result = parse(html).unwrap();
println!("Title: {:?}", result.metadata.title);

// With base URL for resolving relative links
let parser = HtmlParser::with_base_url("https://example.com").unwrap();
let result = parser.parse(html).unwrap();

Architecture

This crate is organized into focused modules:

  • types: All type definitions
  • selector: CSS selector utilities and caching
  • metadata: Metadata extraction (OG, Twitter, robots, etc.)
  • text: Text extraction and processing
  • links: Link extraction and analysis
  • content: Structured content extraction (headings, lists, tables, etc.)
  • parser: Main HtmlParser API