Expand description

fetch-data

github crates.io docs.rs build status

Fetch data files from a URL, but only if needed. Verify contents via SHA256.

Fetch-Data checks a local data directory and then downloads needed files. It always verifies the local files and downloaded files via a hash.

Fetch-Data makes it easy to download large and small samples files. For example, here we download a genomics file from GitHub (if it has not already been downloaded). We then print the size of the now local file.

use fetch_data::sample_file;

let path = sample_file("small.fam")?;
println!("{}", std::fs::metadata(path)?.len()); // Prints 85

Features

  • Thread-safe – allowing it to be used with Rust’s multithreaded testing framework.
  • Inspired by Python’s popular Pooch and our PySnpTools filecache module.
  • Avoids run-times such a Tokio by (using ureq to download files via blocking I/O).

Suggested Usage

You can set up FetchData many ways. Here are the steps – followed by sample code – for one set up.

  • Create a registry.txt file containing a whitespace-delimited list of files and their hashes. (This is the same format as Pooch. See section Registry Creation for tips on creating this file.)

  • As shown below, create a global static FetchData instance that reads your registry.txt file. Give it:

    • the URL root from which to download the files
    • an environment variable telling the local data directory in which to store the files
    • a qualifier, organization, and application – Used to create a local data directory when the environment variable is not set. See crate ProjectsDir for details.
  • As shown below, define a public sample_file function that takes a file name and returns a Result containing the path to the downloaded file.

use fetch_data::{ctor, FetchData, FetchDataError};
use std::path::{Path, PathBuf};

#[ctor]
static STATIC_FETCH_DATA: FetchData = FetchData::new(
    include_str!("../registry.txt"),
    "https://raw.githubusercontent.com/CarlKCarlK/fetch-data/main/tests/data/",
    "BAR_APP_DATA_DIR", // env_key
    "com",              // qualifier
    "Foo Corp",         // organization
    "Bar App",          // application
);

/// Download a data file.
pub fn sample_file<P: AsRef<Path>>(path: P) -> Result<PathBuf, FetchDataError> {
    STATIC_FETCH_DATA.fetch_file(path)
}

You can now use your sample_file function to download your files as needed.

Registry Creation

You can create your registry.txt file many ways. Here are the steps – followed by sample code – for one way to create it.

  • Upload your data files to the Internet.
    • For example, Fetch-Data puts its sample data files in tests/data, so they upload to this GitHub folder. In GitHub, by looking at the raw view of a data file, we see the root URL for these files. In cargo.toml, we keep these data files out of our crate via exclude = ["tests/data/*"]
  • As shown below, write code that
    • Creates a FetchData instance without registry contents.
    • Lists the files in your data directory.
    • Calls the gen_registry_contents method on your list of files. This method will download the files, compute their hashes, and create a string of file names and hashes.
  • Print this string, then manually paste it into a file called registry.txt.
use fetch_data::{FetchData, dir_to_file_list};

let fetch_data = FetchData::new(
    "", // registry_contents ignored
    "https://raw.githubusercontent.com/CarlKCarlK/fetch-data/main/tests/data/",
    "BAR_APP_DATA_DIR", // env_key
    "com",              // qualifier
    "Foo Corp",         // organization
    "Bar App",          // application
);
let file_list = dir_to_file_list("tests/data")?;
let registry_contents = fetch_data.gen_registry_contents(file_list)?;
println!("{registry_contents}");

Notes

  • Feature requests and contributions are welcome.

  • Don’t use our sample sample_file. Define your own sample_file that knows where to find your data files.

  • The FetchData instance need not be global and static. See FetchData::new for an example of a non-global instance.

  • Additional methods on the FetchData instance can fetch multiples files and can give the path to the local data directory.

  • You need not use a registry.txt file and FetchData instance. You can instead use the stand-alone function fetch to retrieve a single file with known URL, hash, and local path.

  • Additional stand-alone functions can download files and hash files.

  • Fetch-Data always does binary downloads to maintain consistant line endings across OSs.

  • The Bed-Reader genomics crate uses Fetch-Data.

  • To make FetchData work well as a static global, FetchData::new never fails. Instead, FetchData stores any error and returns it when the first call to fetch_file, etc., is made.

  • Debugging this crate under Windows can cause a “Oops! The debug adapter has terminated abnormally” exception. This is some kind of LLVM, Windows, NVIDIA(?) problem via ureq.

  • This crate follows Nine Rules for Elegant Rust Library APIs from Towards Data Science.

Structs

Used to fetch data files from a URL, if needed. It verifies file contents via a hash.

Used to create temporary directories.

Enums

All possible errors returned by this crate and the crates it depends on.

All errors specific to this crate.

Functions

List all the files in a local directory.

Download a file from a URL.

If necessary, retrieve a file from a URL, checking its hash.

Download a file from a URL and compute its hash.

Compute the hash (SHA256) of a local file.

A sample sample_file. Don’t use this. Instead, define your own sample_file function that knows how to fetch your data files.

Attribute Macros

Used to construct global FetchData instance.