Module jupiter::repository

source ·
Expand description

The repository is responsible for storing, processing and updating data files used by other parts of Jupiter.

Locally the repository is simply a local directory. However, using the various commands the contents of this directory can be inspected and updated (via HTTP downloads - e.g. from an AWS S3 bucket).

Once a file (its contents) change, one or more Loaders are invoked to process the updated data. To control which loaders process which file, metadata files are placed in the directory loaders. Each of these files is a YAML file which configure the loader.

Example

A loader descriptior could looke this. Note that each loader descriptor a namespace assigned so that (e.g. when cloning data from git or s3) multiple Jupiters can operate on the same datasource, even if only a part of the datasets is used.

loader: 'idb-csv'         # Use the CsvLoader
file: '/my-data/test.csv' # Process this file
namespace: 'core'         # Only perform a load if the "core" namespace is enabled in the config
table: 'test'             # Load the contents into the IDB test table.
indices: ['code']         # Create a lookup index for the column code.
fulltextIndices: ['name'] # Create a fulltext search index for the column name.

A set of base loaders is provided to load various data files into an one or more Infograph Tables:

  • IdbYamlLoader is used to load a given YAML file into an IDB table. This loader can be referenced as idb-yaml.
  • IdbJsonLoader is used to load a given JSON file into an table. This loader can be referenced as idb-json.
  • IdbCsvLoader is used to load a given CSV file into an table. This loader can be referenced as idb-csv.

Additionally a loader is defined to load Infograph Sets:

  • IdbYamlSetLoader is used to load a given YAML file into the appropriate IDB sets. This loader can be referenced as idb-yaml-sets.

Additional loaders can be registered via register_loader.

Architecture

The repository is split up into three actors. The foreground actor (frontend) receives all commands and processes them. As it is in charge of handling all incoming commands, none of them must block for a long time. Also this actor owns the actual list of files in the repository.

The background actor is in charge of handling all tasks which are potentially long-running. This is mainly the task of downloading files and storing them onto disk. Also to manage concurrency properly this is in charge of scanning the repository contants, deleting files and it also handles executing the actual loaders for each file.

The loaders actor is in charge of managing and detecting which loaders are active and which need to reload or unload as their underlying data or metadata has changed.

Commands

  • REPO.SCAN: REPO.SCAN re-scans the local repository contents on the local disk. This automatically happens at startup and is only required if the contents of the repository are changed by an external process.
  • REPO.FETCH: REPO.FETCH file url instructs the background actor to fetch a file from the given url. Note that the file will only been fetched if it has been modified on the server since it was last fetched.
  • REPO.STORE: REPO.STORE file contents stores the given string contents in a file.
  • REPO.FETCH_FORCED: REPO.FETCH_FORCED file url also fetches the given file, but doesn’t perform any “last modified” checks as REPO.FETCH would.
  • REPO.LIST: REPO.LIST lists all files in the repository. Note that this will yield a more or less human readable output where as REPO.LIST raw will return an array with provides a child array per file containing filename, filesize, last modified.
  • REPO.DELETE: REPO.DELETE file deletes the given file from the repository.
  • REPO.INC_EPOCH: REPO.INC_EPOCH immediately increments the epoch counter of the foreground actor and schedules a background tasks to increment the background epoch. Calling this after some repository tasks have been executed can be used to determine if all tasks have been handled.
  • REPO.EPOCHS: REPO.EPOCHS reads the foreground and background epoch. Calling first REPO.INC_EPOCHand then REPO.EPOCHS one can determine if the background actor is currently working (downloading files or performing loader tasks) or if everything is handled. As INC_EPOCH is handled via the background loop, the returned epochs will differ, as long as the background actors is processing other tasks. Once the foreground epoch and the background one are the same, one can assume that all repository tasks have been handled.

Testing

As the repository not only relies on the correct behaviour of each actor but also on their orchestration, we provide most tests as integration tests below to ensure that the whole system works as expected.

Examples

In order to use a repository within Jupiter, simply create it, register the desired loaders and install it:


#[tokio::main]
async fn main() {
    // Create a platform...
    let platform = Builder::new().enable_all().build().await;
     
    // Create a repository...
    let repository = jupiter::repository::create(&platform);
     
    // Install Jupiter or custom loaders...
    repository.register_loader("yaml".to_string(), Arc::new(IdbYamlLoader::new(platform.clone())));
     
    // Install the repository (this has to be done last, as this will perform the initial
    // repository scan, therefore the loaders have to be known...)..
    jupiter::repository::install(platform.clone(), repository);

    // Start the main event loop...
    platform.require::<Server>().event_loop();
}

Modules

  • Defines loaders which are in charge of processing repository files.

Structs

  • A repository is the central instance which connects the several actors together.
  • Represents a file within the repository.

Enums

  • Represents events which are sent back from the background worker to the frontend.
  • Events like this will be broadcast by the repository to all listeners which registered themselves via Repository::listener.

Functions

  • Creates a new repository and installs some standard loaders.
  • Inserts the repository into the given platform and starts all required actors.