extrablatt_v2 0.3.2

extrablatt_v2
=====================
[![Crates.io](https://img.shields.io/crates/v/extrablatt_v2.svg)](https://crates.io/crates/extrablatt_v2)
[![Documentation](https://docs.rs/extrablatt_v2/badge.svg)](https://docs.rs/extrablatt_v2)

This is fork of [an original repository](https://github.com/mattsse/extrablatt) "extrablatt" with some updated dependencies.

Customizable article scraping & curation library and CLI.
Also runs in Wasm.

Original project kinda supports WASM:
Basic Wasm example with some CORS limitations: [https://mattsse.github.io/extrablatt/](https://mattsse.github.io/extrablatt/)


Inspired by [newspaper](https://github.com/codelucas/newspaper).

Html Scraping is done via [select.rs](https://github.com/utkarshkukreti/select.rs).

## Features


* News url identification
* Text extraction
* Top image extraction
* All image extraction
* Keyword extraction
* Author extraction
* Publishing date
* References

Customizable for specific news sites/layouts via the `Extractor` trait.

## Diffences from original extrablatt


* Updated dependencies
* More heuristics for article body/authors and etc data extraction
* Reoganized code structure
* More references to [newspaper4k](https://github.com/AndyTheFactory/newspaper4k) ideas
* Configurable threads num
* I am not used to use WASM or CLI in this fork, so those parts are mostly untouched and I can't guarantee they work as expected.

## Documentation


Full Documentation [https://docs.rs/extrablatt_v2](https://docs.rs/extrablatt_v2)

## Example


Extract all Articles from news outlets.

````rust
use extrablatt_v2::Extrablatt;
use futures::StreamExt;

#[tokio::main]

async fn main() -> Result<(), Box<dyn std::error::Error>> {

    let site = Extrablatt::builder("https://some-news.com/")?.build().await?;

    let mut stream = site.into_stream();
    
    while let Some(article) = stream.next().await {
        if let Ok(article) = article {
            println!("article '{:?}'", article.content.title)
        } else {
            println!("{:?}", article);
        }
    }

    Ok(())
}
````

## Command Line


### Install


```bash
cargo install extrablatt_v2 --features="cli"
```

### Usage 


```text
USAGE:
    extrablatt_v2 <SUBCOMMAND>

SUBCOMMANDS:
    article     Extract a set of articles
    category    Extract all articles found on the page
    help        Prints this message or the help of the given subcommand(s)
    site        Extract all articles from a news source.

```

### Extract a set of specific articles and store the result as json


````bash
extrablatt_v2 article "https://www.example.com/article1.html", "https://www.example.com/article2.html" -o "articles.json"
````

## License


Licensed under either of these:

 * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or
   https://www.apache.org/licenses/LICENSE-2.0)
 * MIT license ([LICENSE-MIT](LICENSE-MIT) or
   https://opensource.org/licenses/MIT)