Module documents

Module documents 

Source
Expand description

Document extraction for halldyll-media

Extracts documents from HTML with support for:

  • PDF links
  • Office documents (Word, Excel, PowerPoint)
  • E-books (EPUB)
  • Download links

Functionsยง

count_by_type
Count documents by type
extract_documents
Extract all documents from HTML document
extract_documents_from_html
Extract documents from HTML string
get_document_urls
Get all document URLs from HTML
get_office_docs
Get Office documents
get_pdfs
Get PDFs only
has_documents
Check if HTML has documents