Crate scrapyard

Expand description

Automatic web scraper and RSS generator library

§Quickstart

Get started by creating an event loop.

#[tokio::main]
async fn main() {
    // initialise values
    scrapyard::init(None).await;
     
    // load feeds from a config file
    // or create a default config file
    let feeds_path = PathBuf::from("feeds.json");
    let feeds = Feeds::load_json(&feeds_path).await
    .unwrap_or_else(|| {
        let default = Feeds::default();
        default.save_json();
        default
    });
     
    // start the event loop, this will not block
    feeds.start_loop().await;
     
    // as long as the program is running
    // the feeds will be updated regularly
    HttpServer::new(|| {})
        .bind(("0.0.0.0", 8080)).unwrap()
        .run().await.unwrap();
}

§Configuration

By default, config files can be found in ~/.config/scrapyard (Linux), /Users/[Username]/Library/Application/Support/scrapyard (Mac) or C:\Users\[Username]\AppData\Roaming\scrapyard (Windows).

To change the config directory location, specify the path:

let config_path = PathBuf::from("/my/special/path");
scrapyard::init(Some(config_path)).await;

Here are all the options in the main configuration file scrapyard.json.

{
    "store": String, // i.e. /home/user/.local/share/scrapyard/
    "max-retries": Number, // number of retries before giving up
    "request-timeout": Number, // number of seconds before giving up request
    "script-timeout": Number, // number of seconds before giving up on the extractor script
}

§Adding feeds

To add feeds, edit feeds.json.

{
    "origin": String, // origin of the feed
    "label": String, // text id of the feed
    "max-length": Number, // maximum number of items allowed in the feed
    "fetch-length": Number, // maximum number of items allowed to be fetched each interval
    "interval": Number, // number of seconds between fetching,
    "idle-limit": Number, // number of seconds without requests to that feed before fetching stops
    "sort": Boolean, // to sort by publish date or not
    "extractor": [String], // all command line args to run the extractor, i.e. ["node", "extractor.js"]

    "title": String, // displayed feed title
    "link": String, // displayed feed source url
    "description": String, // displayed feed description
    "fetch": Boolean // should the crate fetch the content, or let the script do it
}

You can also include additional fields in PseudoChannel to overwrite default empty values.

§Getting feeds

Referencing functions under FeedOption, there are 2 types of fetch functions.

Force fetching always request for a new copy of the feed, ignoring the fetch interval. Lazy fetching only fetched a new copy when the existing copy is out of date. This is particularly relevant when used without the auto-fetch loop.

§Extractor scripts

The extractor scripts must accept 1 command line argument and prints out 1 JSON response to stdout, normal console.log() in JS will do. You get the idea.

The first argument would specify a file path, within that file contains the arguments for the scraper.

Command line input:

{
    "url": String, // origin of the info fetched
    "webstr": String?, // response from the url, only if feed.fetch = true
    "preexists": [ PseudoItem ], // don't output these again to avoid duplication
    "lengthLeft": Number // maximum length before the fetch-length quota is met
     
    // plus everything from feed.json
}

Expected output:

{
    "items": [PseudoItem], // list of items extracted
    "continuation": String? // optionally continue fetching in the next url
}

Macros§

take_lock

Structs§

FeedOption: Specific scraping options for a single feed
Feeds: Array of feeds to fetch
FetchedMeta: Feed metadata
ItemizerArg: Json arguments for the scraper script
ItemizerRes: Json response expected from the scraper script
MasterConfig: Main config file
PseudoCategory: Serde impled version of rss::Category
PseudoChannel: Serde impled version of rss::Channel
PseudoCloud: Serde impled version of rss::Cloud
PseudoEnclosure: Serde impled version of rss::Enclosure
PseudoGuid: Serde impled version of rss::Guid
PseudoImage: Serde impled version of rss::Image
PseudoItem: Serde impled version of rss::Item
PseudoItemCache: A vector of PseudoItem for saving as json
PseudoSource: Serde impled version of rss::Source
PseudoTextInput: Serde impled version of rss::TextInput

Enums§

Error

Statics§

IDENT: Self identifier of the crate: scrapyard X.Y.Z (git 123abcd)
LOCKS: Fetch locks to avoid duplicated fetching
MASTER: Holds global master config

Traits§

Saveable: An convenience trait to quickly load and save files

Functions§

init: Initialise all OnceLocks

Crate scrapyard

Crate scrapyard Copy item path

§Quickstart

§Configuration

§Adding feeds

§Getting feeds

§Extractor scripts

Macros§

Structs§

Enums§

Statics§

Traits§

Functions§

Crate scrapyard