Crate scrapyard

Source
Expand description

Automatic web scraper and RSS generator library

§Quickstart

Get started by creating an event loop.

#[tokio::main]
async fn main() {
    // initialise values
    scrapyard::init(None).await;
     
    // load feeds from a config file
    // or create a default config file
    let feeds_path = PathBuf::from("feeds.json");
    let feeds = Feeds::load_json(&feeds_path).await
    .unwrap_or_else(|| {
        let default = Feeds::default();
        default.save_json();
        default
    });
     
    // start the event loop, this will not block
    feeds.start_loop().await;
     
    // as long as the program is running
    // the feeds will be updated regularly
    HttpServer::new(|| {})
        .bind(("0.0.0.0", 8080)).unwrap()
        .run().await.unwrap();
}

§Configuration

By default, config files can be found in ~/.config/scrapyard (Linux), /Users/[Username]/Library/Application/Support/scrapyard (Mac) or C:\Users\[Username]\AppData\Roaming\scrapyard (Windows).

To change the config directory location, specify the path:

let config_path = PathBuf::from("/my/special/path");
scrapyard::init(Some(config_path)).await;

Here are all the options in the main configuration file scrapyard.json.

{
    "store": String, // i.e. /home/user/.local/share/scrapyard/
    "max-retries": Number, // number of retries before giving up
    "request-timeout": Number, // number of seconds before giving up request
    "script-timeout": Number, // number of seconds before giving up on the extractor script
}

§Adding feeds

To add feeds, edit feeds.json.

{
    "origin": String, // origin of the feed
    "label": String, // text id of the feed
    "max-length": Number, // maximum number of items allowed in the feed
    "fetch-length": Number, // maximum number of items allowed to be fetched each interval
    "interval": Number, // number of seconds between fetching,
    "idle-limit": Number, // number of seconds without requests to that feed before fetching stops
    "sort": Boolean, // to sort by publish date or not
    "extractor": [String], // all command line args to run the extractor, i.e. ["node", "extractor.js"]

    "title": String, // displayed feed title
    "link": String, // displayed feed source url
    "description": String, // displayed feed description
    "fetch": Boolean // should the crate fetch the content, or let the script do it
}

You can also include additional fields in PseudoChannel to overwrite default empty values.

§Getting feeds

Referencing functions under FeedOption, there are 2 types of fetch functions.

Force fetching always request for a new copy of the feed, ignoring the fetch interval. Lazy fetching only fetched a new copy when the existing copy is out of date. This is particularly relevant when used without the auto-fetch loop.

§Extractor scripts

The extractor scripts must accept 1 command line argument and prints out 1 JSON response to stdout, normal console.log() in JS will do. You get the idea.

The first argument would specify a file path, within that file contains the arguments for the scraper.

Command line input:

{
    "url": String, // origin of the info fetched
    "webstr": String?, // response from the url, only if feed.fetch = true
    "preexists": [ PseudoItem ], // don't output these again to avoid duplication
    "lengthLeft": Number // maximum length before the fetch-length quota is met
     
    // plus everything from feed.json
}

Expected output:

{
    "items": [PseudoItem], // list of items extracted
    "continuation": String? // optionally continue fetching in the next url
}

Macros§

take_lock

Structs§

FeedOption
Specific scraping options for a single feed
Feeds
Array of feeds to fetch
FetchedMeta
Feed metadata
ItemizerArg
Json arguments for the scraper script
ItemizerRes
Json response expected from the scraper script
MasterConfig
Main config file
PseudoCategory
Serde impled version of rss::Category
PseudoChannel
Serde impled version of rss::Channel
PseudoCloud
Serde impled version of rss::Cloud
PseudoEnclosure
Serde impled version of rss::Enclosure
PseudoGuid
Serde impled version of rss::Guid
PseudoImage
Serde impled version of rss::Image
PseudoItem
Serde impled version of rss::Item
PseudoItemCache
A vector of PseudoItem for saving as json
PseudoSource
Serde impled version of rss::Source
PseudoTextInput
Serde impled version of rss::TextInput

Enums§

Error

Statics§

IDENT
Self identifier of the crate: scrapyard X.Y.Z (git 123abcd)
LOCKS
Fetch locks to avoid duplicated fetching
MASTER
Holds global master config

Traits§

Saveable
An convenience trait to quickly load and save files

Functions§

init
Initialise all OnceLocks