extrablatt_v2

This is fork of an original repository "extrablatt" with some updated dependencies.

Customizable article scraping & curation library and CLI. Also runs in Wasm.

Original project kinda supports WASM: Basic Wasm example with some CORS limitations: https://mattsse.github.io/extrablatt/

Inspired by newspaper.

Html Scraping is done via select.rs.

Features

News url identification
Text extraction
Top image extraction
All image extraction
Keyword extraction
Author extraction
Publishing date
References

Customizable for specific news sites/layouts via the Extractor trait.

Diffences from original extrablatt

Updated dependencies
More heuristics for article body/authors and etc data extraction
Reoganized code structure
More references to newspaper4k ideas

Documentation

Full Documentation https://docs.rs/extrablatt_v2

Example

Extract all Articles from news outlets.

use extrablatt_v2::Extrablatt;
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {

    let site = Extrablatt::builder("https://some-news.com/")?.build().await?;

    let mut stream = site.into_stream();
    
    while let Some(article) = stream.next().await {
        if let Ok(article) = article {
            println!("article '{:?}'", article.content.title)
        } else {
            println!("{:?}", article);
        }
    }

    Ok(())
}

Command Line

Install

cargo install extrablatt_v2 --features="cli"

Usage

USAGE:
    extrablatt_v2 <SUBCOMMAND>

SUBCOMMANDS:
    article     Extract a set of articles
    category    Extract all articles found on the page
    help        Prints this message or the help of the given subcommand(s)
    site        Extract all articles from a news source.

Extract a set of specific articles and store the result as json

extrablatt_v2 article "https://www.example.com/article1.html", "https://www.example.com/article2.html" -o "articles.json"

License

Licensed under either of these:

Apache License, Version 2.0, (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or https://opensource.org/licenses/MIT)

extrablatt_v2 0.3.1