fetch-data
Fetch data files from a URL, but only if needed. Verify contents via SHA256. Some Python Pooch compatibility.
Fetch-Data checks a local data directory and then downloads needed files. It always verifies the local files and downloaded files via a hash.
Fetch-Data makes it easy to download large and small sample files. For example, here we download a genomics file from GitHub (if it has not already been downloaded). We then print the size of the now local file.
use sample_file;
let path = sample_file?;
println!; // Prints 85
# Ok::
Features
- Thread-safe -- allowing it to be used with Rust's multithreaded testing framework.
- Inspired by Python's popular Pooch and our PySnpTools filecache module.
- Avoids run-times such as Tokio (by using
ureqto download files via blocking I/O).
Suggested Usage
You can set up FetchData many ways. Here are the steps -- followed by sample code -- for one set up.
-
Create a
registry.txtfile containing a whitespace-delimited list of files and their hashes. (This is the same format as Pooch. See section Registry Creation for tips on creating this file.) -
As shown below, create a global static
FetchDatainstance that reads yourregistry.txtfile. Give it:- the URL root from which to download the files
- an environment variable telling the local data directory in which to store the files
- a
qualifier,organization, andapplication-- Used to create a local data directory when the environment variable is not set. See crate ProjectsDir for details.
-
As shown below, define a public
sample_filefunction that takes a file name and returns aResultcontaining the path to the downloaded file.
use ;
use ;
static STATIC_FETCH_DATA: FetchData = new;
/// Download a data file.
You can now use your sample_file function to download your files as needed.
Registry Creation
You can create your registry.txt file many ways. Here are the steps -- followed by sample code -- for one way to create it.
- Upload your data files to the Internet.
- For example,
Fetch-Dataputs its sample data files intests/data, so they upload to this GitHub folder. In GitHub, by looking at the raw view of a data file, we see the root URL for these files. Incargo.toml, we keep these data files out of our crate viaexclude = ["tests/data/*"]
- For example,
- As shown below, write code that
- Creates a
FetchDatainstance without registry contents. - Lists the files in your data directory.
- Calls the
gen_registry_contentsmethod on your list of files. This method will download the files, compute their hashes, and create a string of file names and hashes.
- Creates a
- Print this string, then manually paste it into a file called
registry.txt.
use ;
let fetch_data = new;
let file_list = dir_to_file_list?;
let registry_contents = fetch_data.gen_registry_contents?;
println!;
# use FetchDataError; // '#' needed for doctest
# Ok::
Notes
-
Feature requests and contributions are welcome.
-
Don't use our sample
sample_file. Define your ownsample_filethat knows where to find your data files. -
The
FetchDatainstance need not be global and static. SeeFetchData::newfor an example of a non-global instance. -
Additional
methods on the FetchDatainstance can fetch multiples files and can give the path to the local data directory. -
You need not use a
registry.txtfile andFetchDatainstance. You can instead use the stand-alone functionfetchto retrieve a single file with known URL, hash, and local path. -
Additional stand-alone functions can download files and hash files.
-
Fetch-Dataalways does binary downloads to maintain consistent line endings across OSs. -
The Bed-Reader genomics crate uses
Fetch-Data. -
To make
FetchDatawork well as a static global,FetchData::newnever fails. Instead,FetchDatastores any error and returns it when the first call tofetch_file, etc., is made. -
Debugging this crate under Windows can cause a "Oops! The debug adapter has terminated abnormally" exception. This is some kind of LLVM, Windows, NVIDIA(?) problem via ureq.
-
This crate follows Nine Rules for Elegant Rust Library APIs from Towards Data Science.