[−][src]Crate extrablatt
Extrablatt: Customizable article scraping & curation.
A library to gather and scrape news articles. This functionality is also
provided as a CLI utility. Extrablatt
supports targeted article scraping and non-targeted gathering of articles
either limited by category or unrestricted for an entire website. Extrablatt
allows for customized extraction of the textual content of an article by
means of the crate::extract::Extractor
trait.
Examples
Extract all content from a single news article
extrablatt::Article::get("http://example.com/interesting-article.html");
A crate::Category
represents a broader collection of articles, like Sports or Politics. Categores usually have their own designated page, e.g. https://some-news.com/sports
. By targeting specific crate::Category
, Extrablatt tries to identify all news articles on the categorie's page and then scrape them afterwards.
use futures::stream::StreamExt; let mut stream = extrablatt::Category::new("https://some-news.com/sports".parse().unwrap()) .into_stream() .await?; while let Some(article) = stream.next().await { //... }
The streaming of articles is made possible by crate::ArticleStream
,
which lets you turn any website into futures::stream::Stream
fo
articles.
crate::Extrablatt
is essentially a cache of articles of an entire
news site. It first tries to identify all categories from the main page and
then download and and scrape every article. Failed downloading attempts can
easily be repeated.
use futures::stream::StreamExt; let mut site = extrablatt::Extrablatt::builder("https://some-news.com/")?.build().await?; site.download_all_remaining_categories().await; for(url, content) in site.download_articles().await.successes() { // ... }
However crate::Extrablatt
can also be consumed and turned in to an
crate::ArticleStream
covering the whole site.
use futures::stream::StreamExt; let site = extrablatt::Extrablatt::builder("https://some-news.com/")?.build().await?; let mut stream = site.into_stream(); while let Some(article) = stream.next().await { if let Ok(article) = article { println!("article '{:?}'", article.content.title) } else { println!("{:?}", article); } }
By default, extraction of valuable information is done using the
crate::DefaultExtractor
which only uses the default implementation
of the crate::Extractor
trait. To customize the content extraction,
the trait must be implemented for a custom extractor.
use extrablatt::{Extractor, Language}; use extrablatt::select::{predicate::Attr, document::Document}; use std::borrow::Cow; pub struct MyExtractor { //... } impl Extractor for MyExtractor { fn text<'a>(&self, doc: &'a Document, lang: Language) -> Option<Cow<'a, str>> { // identify and extract the article's text from the `doc`, // e.g. by finding the very node that holds the text doc.find(Attr("id", "article")).next().map(|n| n.text().into()) } } let extractor = MyExtractor{}; let article = extrablatt::Article::get_with_extractor( "http://example.com/interesting-article.html", &extractor, ) .await .unwrap();
Re-exports
pub use select; |
pub use crate::article::Article; |
pub use crate::article::PureArticle; |
pub use crate::category::Category; |
pub use crate::extrablatt::ArticleStream; |
pub use crate::extrablatt::Config; |
pub use crate::extrablatt::Extrablatt; |
pub use crate::extrablatt::ExtrablattBuilder; |
pub use crate::extract::DefaultExtractor; |
pub use crate::extract::Extractor; |
pub use crate::language::Language; |
Modules
article | |
category | |
clean | |
date | |
extrablatt | |
extract | |
image | |
language | |
nlp | |
text | |
video |