Module documents

Module documents

Expand description

Document extraction for halldyll-media

Extracts documents from HTML with support for:

PDF links
Office documents (Word, Excel, PowerPoint)
E-books (EPUB)
Download links

Functions§

count_by_type: Count documents by type
extract_documents: Extract all documents from HTML document
extract_documents_from_html: Extract documents from HTML string
get_document_urls: Get all document URLs from HTML
get_office_docs: Get Office documents
get_pdfs: Get PDFs only
has_documents: Check if HTML has documents