Expand description
High-level command interface for CLI This module contains the main logic for each CLI command
Functionsยง
- extract_
article - Extract a complete article (title/date metadata + body) from one page.
- extract_
article_ hybrid - Extract a complete article using a prepared
HybridExtractor(which may be backed by a trainedNodeClassifieror the heuristic). Title/date come from the baseline metadata extractor; falls back to the baseline body if the hybrid selector returns nothing. - extract_
batch - Extract batch of HTML files with site profile support
- extract_
domain_ from_ url - Extract domain from URL
- extract_
single - Extract article from single HTML file.
- load_
html_ files_ recursive - Load HTML files recursively
- read_
html_ file - Read HTML file with UTF-8 error handling
- read_
url_ from_ json - Read URL from JSON file