Expand description
§fetch-data
Fetch data files from a URL, but only if needed. Verify contents via SHA256. Some Python Pooch compatibility.
Fetch-Data checks a local data directory and then downloads needed files. It always verifies the local files and downloaded files via a hash.
Fetch-Data makes it easy to download large and small sample files. For example, here we download a genomics file from GitHub (if it has not already been downloaded). We then print the size of the now local file.
use fetch_data::sample_file;
let path = sample_file("small.fam")?;
println!("{}", std::fs::metadata(path)?.len()); // Prints 85
§Features
- Thread-safe – allowing it to be used with Rust’s multithreaded testing framework.
- Inspired by Python’s popular Pooch and our PySnpTools filecache module.
- Avoids run-times such as Tokio (by using
ureqto download files via blocking I/O).
§Suggested Usage
You can set up FetchData many ways. Here are the steps – followed by sample code – for one set up.
-
Create a
registry.txtfile containing a whitespace-delimited list of files and their hashes. (This is the same format as Pooch. See section Registry Creation for tips on creating this file.) -
As shown below, create a global static
FetchDatainstance that reads yourregistry.txtfile. Give it:- the URL root from which to download the files
- an environment variable telling the local data directory in which to store the files
- a
qualifier,organization, andapplication– Used to create a local data directory when the environment variable is not set. See crate ProjectsDir for details.
-
As shown below, define a public
sample_filefunction that takes a file name and returns aResultcontaining the path to the downloaded file.
use fetch_data::{ctor, FetchData, FetchDataError};
use std::path::{Path, PathBuf};
#[ctor]
static STATIC_FETCH_DATA: FetchData = FetchData::new(
include_str!("../registry.txt"),
"https://raw.githubusercontent.com/CarlKCarlK/fetch-data/main/tests/data/",
"BAR_APP_DATA_DIR", // env_key
"com", // qualifier
"Foo Corp", // organization
"Bar App", // application
);
/// Download a data file.
pub fn sample_file<P: AsRef<Path>>(path: P) -> Result<PathBuf, Box<FetchDataError>> {
STATIC_FETCH_DATA.fetch_file(path)
}
You can now use your sample_file function to download your files as needed.
§Registry Creation
You can create your registry.txt file many ways. Here are the steps – followed by sample code – for one way to create it.
- Upload your data files to the Internet.
- For example,
Fetch-Dataputs its sample data files intests/data, so they upload to this GitHub folder. In GitHub, by looking at the raw view of a data file, we see the root URL for these files. Incargo.toml, we keep these data files out of our crate viaexclude = ["tests/data/*"]
- For example,
- As shown below, write code that
- Creates a
FetchDatainstance without registry contents. - Lists the files in your data directory.
- Calls the
gen_registry_contentsmethod on your list of files. This method will download the files, compute their hashes, and create a string of file names and hashes.
- Creates a
- Print this string, then manually paste it into a file called
registry.txt.
use fetch_data::{FetchData, dir_to_file_list};
let fetch_data = FetchData::new(
"", // registry_contents ignored
"https://raw.githubusercontent.com/CarlKCarlK/fetch-data/main/tests/data/",
"BAR_APP_DATA_DIR", // env_key
"com", // qualifier
"Foo Corp", // organization
"Bar App", // application
);
let file_list = dir_to_file_list("tests/data")?;
let registry_contents = fetch_data.gen_registry_contents(file_list)?;
println!("{registry_contents}");
§Notes
-
Feature requests and contributions are welcome.
-
Don’t use our sample
sample_file. Define your ownsample_filethat knows where to find your data files. -
The
FetchDatainstance need not be global and static. SeeFetchData::newfor an example of a non-global instance. -
Additional
methods on the FetchDatainstance can fetch multiples files and can give the path to the local data directory. -
You need not use a
registry.txtfile andFetchDatainstance. You can instead use the stand-alone functionfetchto retrieve a single file with known URL, hash, and local path. -
Additional stand-alone functions can download files and hash files.
-
Fetch-Dataalways does binary downloads to maintain consistent line endings across OSs. -
The Bed-Reader genomics crate uses
Fetch-Data. -
To make
FetchDatawork well as a static global,FetchData::newnever fails. Instead,FetchDatastores any error and returns it when the first call tofetch_file, etc., is made. -
Debugging this crate under Windows can cause a “Oops! The debug adapter has terminated abnormally” exception. This is some kind of LLVM, Windows, NVIDIA(?) problem via ureq.
-
This crate follows Nine Rules for Elegant Rust Library APIs from Towards Data Science.
§Project Links
Structs§
- Fetch
Data - Used to fetch data files from a URL, if needed. It verifies file contents via a hash.
Enums§
- Fetch
Data Error - All possible errors returned by this crate and the crates it depends on.
- Fetch
Data Specific Error - All errors specific to this crate.
Functions§
- dir_
to_ file_ list - List all the files in a local directory.
- download
- Download a file from a URL.
- fetch
- If necessary, retrieve a file from a URL, checking its hash.
- hash_
download - Download a file from a URL and compute its hash.
- hash_
file - Compute the hash (SHA256) of a local file.
- sample_
file - A sample sample_file. Don’t use this. Instead, define your own
sample_filefunction that knows how to fetch your data files.
Attribute Macros§
- ctor
- Used to construct global
FetchDatainstance.