Expand description
§halldyll-media
Comprehensive media extraction library for web scraping.
This crate provides functionality to extract and process various types of media from HTML documents:
- Images: With srcset, lazy loading, and placeholder detection
- Videos: HTML5 video and embedded platforms (YouTube, Vimeo, etc.)
- Audio: HTML5 audio and streaming platforms (Spotify, SoundCloud, etc.)
- Documents: PDF, Word, Excel, PowerPoint, and other document formats
- Embedded: iframes, social embeds, maps, and widgets
§Quick Start
use halldyll_media::{MediaExtractor, MediaType};
let html = r#"
<img src="image.jpg" alt="Photo">
<video src="video.mp4"></video>
"#;
let extractor = MediaExtractor::new()
.with_base_url("https://example.com");
let media = extractor.extract_all(html).unwrap();
println!("Found {} images", media.images.len());
println!("Found {} videos", media.videos.len());§Modules
types: Core types, errors, and configurationimages: Image extraction with responsive image supportvideos: Video extraction with embedded platform detectionaudio: Audio extraction with streaming platform supportdocuments: Document extraction (PDF, Office, etc.)embedded: Embedded content extraction (iframes, social, widgets)downloader: Async media downloading with hashingextractor: Main MediaExtractor API
Re-exports§
pub use types::MediaError;pub use types::MediaResult;pub use types::MediaType;pub use types::ImageFormat;pub use types::ImageLoading;pub use types::ImageMedia;pub use types::SrcsetEntry;pub use types::VideoMedia;pub use types::VideoSource;pub use types::VideoTrack;pub use types::TrackKind;pub use types::VideoPlatform;pub use types::AudioMedia;pub use types::AudioSource;pub use types::AudioPlatform;pub use types::DocumentMedia;pub use types::DocumentType;pub use types::EmbeddedMedia;pub use types::EmbedPlatform;pub use types::EmbedType;pub use types::LinkMedia;pub use types::LinkType;pub use types::MediaConfig;pub use types::DownloadConfig;pub use types::DownloadResult;pub use types::ExtractedMedia;pub use images::extract_images;pub use images::get_image_urls;pub use images::has_images;pub use images::filter_placeholders;pub use images::get_best_image_url;pub use videos::extract_videos;pub use videos::get_video_urls;pub use videos::has_videos;pub use videos::youtube_thumbnail;pub use videos::youtube_embed_url;pub use audio::extract_audio;pub use audio::get_audio_urls;pub use audio::has_audio;pub use audio::spotify_embed_url;pub use documents::extract_documents;pub use documents::get_document_urls;pub use documents::has_documents;pub use documents::get_pdfs;pub use documents::get_office_docs;pub use embedded::extract_embeds;pub use embedded::get_embed_urls;pub use embedded::has_embeds;pub use embedded::detect_embed_platform;pub use embedded::filter_by_platform;pub use embedded::get_maps;pub use embedded::get_code_embeds;pub use links::extract_links;pub use links::get_link_urls;pub use links::has_links;pub use links::get_internal_links;pub use links::get_external_links;pub use links::get_download_links;pub use links::get_nofollow_links;pub use links::get_mailto_links;pub use links::extract_emails;pub use links::extract_phones;pub use links::get_unique_domains;pub use downloader::MediaDownloader;pub use downloader::download_bytes;pub use downloader::download_with_hash;pub use downloader::download_to_base64;pub use downloader::save_to_file;pub use downloader::compute_sha256;pub use downloader::detect_media_type;pub use downloader::url_to_filename;pub use downloader::is_downloadable;pub use extractor::MediaExtractor;pub use extractor::MediaExtractorBuilder;pub use extractor::MediaCounts;pub use extractor::extract_media;pub use extractor::has_any_media;pub use extractor::count_all_media;pub use extractor::get_all_media_urls;
Modules§
- audio
- Audio extraction for halldyll-media
- documents
- Document extraction for halldyll-media
- downloader
- Async media downloader for halldyll-media
- embedded
- Embedded content extraction for halldyll-media
- extractor
- Main media extractor API for halldyll-media
- images
- Image extraction for halldyll-media
- links
- Link extraction for halldyll-media
- types
- Core types for halldyll-media
- videos
- Video extraction for halldyll-media