halldyll-media
Comprehensive media extraction library for web scraping.
This crate provides functionality to extract and process various types of media from HTML documents:
- Images: With srcset, lazy loading, and placeholder detection
- Videos: HTML5 video and embedded platforms (YouTube, Vimeo, etc.)
- Audio: HTML5 audio and streaming platforms (Spotify, SoundCloud, etc.)
- Documents: PDF, Word, Excel, PowerPoint, and other document formats
- Embedded: iframes, social embeds, maps, and widgets
Quick Start
use ;
let html = r#"
<img src="image.jpg" alt="Photo">
<video src="video.mp4"></video>
"#;
let extractor = new
.with_base_url;
let media = extractor.extract_all.unwrap;
println!;
println!;
Modules
- [
types]: Core types, errors, and configuration - [
images]: Image extraction with responsive image support - [
videos]: Video extraction with embedded platform detection - [
audio]: Audio extraction with streaming platform support - [
documents]: Document extraction (PDF, Office, etc.) - [
embedded]: Embedded content extraction (iframes, social, widgets) - [
downloader]: Async media downloading with hashing - [
extractor]: Main MediaExtractor API