halldyll-media 0.1.0

Media extraction (images, videos, links) for halldyll scraper
Documentation

halldyll-media

Comprehensive media extraction library for web scraping.

This crate provides functionality to extract and process various types of media from HTML documents:

  • Images: With srcset, lazy loading, and placeholder detection
  • Videos: HTML5 video and embedded platforms (YouTube, Vimeo, etc.)
  • Audio: HTML5 audio and streaming platforms (Spotify, SoundCloud, etc.)
  • Documents: PDF, Word, Excel, PowerPoint, and other document formats
  • Embedded: iframes, social embeds, maps, and widgets

Quick Start

use halldyll_media::{MediaExtractor, MediaType};

let html = r#"
    <img src="image.jpg" alt="Photo">
    <video src="video.mp4"></video>
"#;

let extractor = MediaExtractor::new()
    .with_base_url("https://example.com");

let media = extractor.extract_all(html).unwrap();
println!("Found {} images", media.images.len());
println!("Found {} videos", media.videos.len());

Modules

  • [types]: Core types, errors, and configuration
  • [images]: Image extraction with responsive image support
  • [videos]: Video extraction with embedded platform detection
  • [audio]: Audio extraction with streaming platform support
  • [documents]: Document extraction (PDF, Office, etc.)
  • [embedded]: Embedded content extraction (iframes, social, widgets)
  • [downloader]: Async media downloading with hashing
  • [extractor]: Main MediaExtractor API