halldyll-media 0.1.0

Media extraction (images, videos, links) for halldyll scraper

Coverage
65.74%
259 out of 394 items documented1 out of 150 items with examples
Size
Source code size: 247.49 kB This is the summed size of all the files inside the crates.io package for this release.
Documentation size: 4.35 MB This is the summed size of all files generated by rustdoc for all configured targets
Ø build duration
this release: 1m 5s Average build duration of successful builds.
all releases: 1m 5s Average build duration of successful builds in releases after 2024-10-23.
Links
Mr-soloDev/halldyll-Scrapper
0 0 0
crates.io
Dependencies
Versions
- 0.1.0 (2026-01-20)
Owners

halldyll-media

Comprehensive media extraction library for web scraping.

This crate provides functionality to extract and process various types of media from HTML documents:

Images: With srcset, lazy loading, and placeholder detection
Videos: HTML5 video and embedded platforms (YouTube, Vimeo, etc.)
Audio: HTML5 audio and streaming platforms (Spotify, SoundCloud, etc.)
Documents: PDF, Word, Excel, PowerPoint, and other document formats
Embedded: iframes, social embeds, maps, and widgets

Quick Start

use halldyll_media::{MediaExtractor, MediaType};

let html = r#"
    <img src="image.jpg" alt="Photo">
    <video src="video.mp4"></video>
"#;

let extractor = MediaExtractor::new()
    .with_base_url("https://example.com");

let media = extractor.extract_all(html).unwrap();
println!("Found {} images", media.images.len());
println!("Found {} videos", media.videos.len());

Modules

[types]: Core types, errors, and configuration
[images]: Image extraction with responsive image support
[videos]: Video extraction with embedded platform detection
[audio]: Audio extraction with streaming platform support
[documents]: Document extraction (PDF, Office, etc.)
[embedded]: Embedded content extraction (iframes, social, widgets)
[downloader]: Async media downloading with hashing
[extractor]: Main MediaExtractor API