Skip to main content

Module cli_utils

Module cli_utils 

Source
Expand description

High-level command interface for CLI This module contains the main logic for each CLI command

Functionsยง

extract_article
Extract a complete article (title/date metadata + body) from one page.
extract_article_hybrid
Extract a complete article using a prepared HybridExtractor (which may be backed by a trained NodeClassifier or the heuristic). Title/date come from the baseline metadata extractor; falls back to the baseline body if the hybrid selector returns nothing.
extract_batch
Extract batch of HTML files with site profile support
extract_domain_from_url
Extract domain from URL
extract_single
Extract article from single HTML file.
load_html_files_recursive
Load HTML files recursively
read_html_file
Read HTML file with UTF-8 error handling
read_url_from_json
Read URL from JSON file