Crate halldyll_media

Crate halldyll_media 

Source
Expand description

§halldyll-media

Comprehensive media extraction library for web scraping.

This crate provides functionality to extract and process various types of media from HTML documents:

  • Images: With srcset, lazy loading, and placeholder detection
  • Videos: HTML5 video and embedded platforms (YouTube, Vimeo, etc.)
  • Audio: HTML5 audio and streaming platforms (Spotify, SoundCloud, etc.)
  • Documents: PDF, Word, Excel, PowerPoint, and other document formats
  • Embedded: iframes, social embeds, maps, and widgets

§Quick Start

use halldyll_media::{MediaExtractor, MediaType};

let html = r#"
    <img src="image.jpg" alt="Photo">
    <video src="video.mp4"></video>
"#;

let extractor = MediaExtractor::new()
    .with_base_url("https://example.com");

let media = extractor.extract_all(html).unwrap();
println!("Found {} images", media.images.len());
println!("Found {} videos", media.videos.len());

§Modules

  • types: Core types, errors, and configuration
  • images: Image extraction with responsive image support
  • videos: Video extraction with embedded platform detection
  • audio: Audio extraction with streaming platform support
  • documents: Document extraction (PDF, Office, etc.)
  • embedded: Embedded content extraction (iframes, social, widgets)
  • downloader: Async media downloading with hashing
  • extractor: Main MediaExtractor API

Re-exports§

pub use types::MediaError;
pub use types::MediaResult;
pub use types::MediaType;
pub use types::ImageFormat;
pub use types::ImageLoading;
pub use types::ImageMedia;
pub use types::SrcsetEntry;
pub use types::VideoMedia;
pub use types::VideoSource;
pub use types::VideoTrack;
pub use types::TrackKind;
pub use types::VideoPlatform;
pub use types::AudioMedia;
pub use types::AudioSource;
pub use types::AudioPlatform;
pub use types::DocumentMedia;
pub use types::DocumentType;
pub use types::EmbeddedMedia;
pub use types::EmbedPlatform;
pub use types::EmbedType;
pub use types::LinkMedia;
pub use types::LinkType;
pub use types::MediaConfig;
pub use types::DownloadConfig;
pub use types::DownloadResult;
pub use types::ExtractedMedia;
pub use images::extract_images;
pub use images::get_image_urls;
pub use images::has_images;
pub use images::filter_placeholders;
pub use images::get_best_image_url;
pub use videos::extract_videos;
pub use videos::get_video_urls;
pub use videos::has_videos;
pub use videos::youtube_thumbnail;
pub use videos::youtube_embed_url;
pub use audio::extract_audio;
pub use audio::get_audio_urls;
pub use audio::has_audio;
pub use audio::spotify_embed_url;
pub use documents::extract_documents;
pub use documents::get_document_urls;
pub use documents::has_documents;
pub use documents::get_pdfs;
pub use documents::get_office_docs;
pub use embedded::extract_embeds;
pub use embedded::get_embed_urls;
pub use embedded::has_embeds;
pub use embedded::detect_embed_platform;
pub use embedded::filter_by_platform;
pub use embedded::get_maps;
pub use embedded::get_social_embeds;
pub use embedded::get_code_embeds;
pub use links::extract_emails;
pub use links::extract_phones;
pub use links::get_unique_domains;
pub use downloader::MediaDownloader;
pub use downloader::download_bytes;
pub use downloader::download_with_hash;
pub use downloader::download_to_base64;
pub use downloader::save_to_file;
pub use downloader::compute_sha256;
pub use downloader::detect_media_type;
pub use downloader::url_to_filename;
pub use downloader::is_downloadable;
pub use extractor::MediaExtractor;
pub use extractor::MediaExtractorBuilder;
pub use extractor::MediaCounts;
pub use extractor::extract_media;
pub use extractor::has_any_media;
pub use extractor::count_all_media;
pub use extractor::get_all_media_urls;

Modules§

audio
Audio extraction for halldyll-media
documents
Document extraction for halldyll-media
downloader
Async media downloader for halldyll-media
embedded
Embedded content extraction for halldyll-media
extractor
Main media extractor API for halldyll-media
images
Image extraction for halldyll-media
links
Link extraction for halldyll-media
types
Core types for halldyll-media
videos
Video extraction for halldyll-media