extrablatt_v2
This is fork of an original repository "extrablatt" with some updated dependencies.
Customizable article scraping & curation library and CLI. Also runs in Wasm.
Original project kinda supports WASM: Basic Wasm example with some CORS limitations: https://mattsse.github.io/extrablatt/
Inspired by newspaper.
Html Scraping is done via select.rs.
Features
- News url identification
- Text extraction
- Top image extraction
- All image extraction
- Keyword extraction
- Author extraction
- Publishing date
- References
Customizable for specific news sites/layouts via the Extractor trait.
Diffences from original extrablatt
- Updated dependencies
- More heuristics for article body/authors and etc data extraction
- Reoganized code structure
- More references to newspaper4k ideas
- Configurable threads num
- Proxy support - route requests through HTTP/HTTPS/SOCKS5 proxies if needed
- I am not used to use WASM or CLI in this fork, so those parts are mostly untouched and I can't guarantee they work as expected.
Documentation
Full Documentation https://docs.rs/extrablatt_v2
Example
Extract all Articles from news outlets.
use Extrablatt;
use StreamExt;
async
Proxy Support
Route all HTTP requests through a proxy server:
use Extrablatt;
async
Supported proxy formats:
http://host:port- HTTP proxyhttps://host:port- HTTPS proxysocks5://host:port- SOCKS5 proxy
Testing Proxy Manually
Use mitmproxy via Docker to verify requests go through the proxy:
# Terminal 1: Start mitmproxy
# Terminal 2: Run the test example
You should see the HTTP request appear in mitmproxy's console, proving traffic is routed through the proxy.
=== Proxy Test ===
Target URL: http://httpbin.org/ip
Proxy: Some("http://127.0.0.1:8080")
Configuring proxy: http://127.0.0.1:8080
Connecting...
SUCCESS: Connected through proxy!
If using mitmproxy, you should see the request in the proxy console.
Note: HTTPS requests through mitmproxy will fail with certificate errors (expected behavior - mitmproxy intercepts SSL). For testing, use HTTP URLs or configure your system to trust mitmproxy's CA certificate.
Command Line
Install
Usage
USAGE:
extrablatt_v2 <SUBCOMMAND>
SUBCOMMANDS:
article Extract a set of articles
category Extract all articles found on the page
help Prints this message or the help of the given subcommand(s)
site Extract all articles from a news source.
Extract a set of specific articles and store the result as json
License
Licensed under either of these:
- Apache License, Version 2.0, (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or https://opensource.org/licenses/MIT)