Crate fetch_data
source · [−]Expand description
fetch-data
Fetch data files from a URL, but only if needed. Verify contents via SHA256.
Fetch-Data
checks a local data directory and then downloads needed files. It always verifies the local files and downloaded files via a hash.
Fetch-Data
makes it easy to download large and small samples files. For example, here we download a genomics file from GitHub (if it has not already been downloaded). We then print the size of the now local file.
use fetch_data::sample_file;
let path = sample_file("small.fam")?;
println!("{}", std::fs::metadata(path)?.len()); // Prints 85
Features
- Thread-safe – allowing it to be used with Rust’s multithreaded testing framework.
- Inspired by Python’s popular Pooch and our PySnpTools filecache module.
- Avoids run-times such a Tokio by (using
ureq
to download files via blocking I/O).
Suggested Usage
You can set up FetchData
many ways. Here are the steps – followed by sample code – for one set up.
-
Create a
registry.txt
file containing a whitespace-delimited list of files and their hashes. (This is the same format as Pooch. See section Registry Creation for tips on creating this file.) -
As shown below, create a global static
FetchData
instance that reads yourregistry.txt
file. Give it:- the URL root from which to download the files
- an environment variable telling the local data directory in which to store the files
- a
qualifier
,organization
, andapplication
– Used to create a local data directory when the environment variable is not set. See crate ProjectsDir for details.
-
As shown below, define a public
sample_file
function that takes a file name and returns aResult
containing the path to the downloaded file.
use fetch_data::{ctor, FetchData, FetchDataError};
use std::path::{Path, PathBuf};
#[ctor]
static STATIC_FETCH_DATA: FetchData = FetchData::new(
include_str!("../registry.txt"),
"https://raw.githubusercontent.com/CarlKCarlK/fetch-data/main/tests/data/",
"BAR_APP_DATA_DIR", // env_key
"com", // qualifier
"Foo Corp", // organization
"Bar App", // application
);
/// Download a data file.
pub fn sample_file<P: AsRef<Path>>(path: P) -> Result<PathBuf, FetchDataError> {
STATIC_FETCH_DATA.fetch_file(path)
}
You can now use your sample_file
function to download your files as needed.
Registry Creation
You can create your registry.txt
file many ways. Here are the steps – followed by sample code – for one way to create it.
- Upload your data files to the Internet.
- For example,
Fetch-Data
puts its sample data files intests/data
, so they upload to this GitHub folder. In GitHub, by looking at the raw view of a data file, we see the root URL for these files. Incargo.toml
, we keep these data files out of our crate viaexclude = ["tests/data/*"]
- For example,
- As shown below, write code that
- Creates a
FetchData
instance without registry contents. - Lists the files in your data directory.
- Calls the
gen_registry_contents
method on your list of files. This method will download the files, compute their hashes, and create a string of file names and hashes.
- Creates a
- Print this string, then manually paste it into a file called
registry.txt
.
use fetch_data::{FetchData, dir_to_file_list};
let fetch_data = FetchData::new(
"", // registry_contents ignored
"https://raw.githubusercontent.com/CarlKCarlK/fetch-data/main/tests/data/",
"BAR_APP_DATA_DIR", // env_key
"com", // qualifier
"Foo Corp", // organization
"Bar App", // application
);
let file_list = dir_to_file_list("tests/data")?;
let registry_contents = fetch_data.gen_registry_contents(file_list)?;
println!("{registry_contents}");
Notes
-
Feature requests and contributions are welcome.
-
Don’t use our sample
sample_file
. Define your ownsample_file
that knows where to find your data files. -
The
FetchData
instance need not be global and static. SeeFetchData::new
for an example of a non-global instance. -
Additional
methods on the FetchData
instance can fetch multiples files and can give the path to the local data directory. -
You need not use a
registry.txt
file andFetchData
instance. You can instead use the stand-alone functionfetch
to retrieve a single file with known URL, hash, and local path. -
Additional stand-alone functions can download files and hash files.
-
Fetch-Data
always does binary downloads to maintain consistant line endings across OSs. -
The Bed-Reader genomics crate uses
Fetch-Data
. -
To make
FetchData
work well as a static global,FetchData::new
never fails. Instead,FetchData
stores any error and returns it when the first call tofetch_file
, etc., is made. -
Debugging this crate under Windows can cause a “Oops! The debug adapter has terminated abnormally” exception. This is some kind of LLVM, Windows, NVIDIA(?) problem via ureq.
-
This crate follows Nine Rules for Elegant Rust Library APIs from Towards Data Science.
Project Links
Structs
Used to fetch data files from a URL, if needed. It verifies file contents via a hash.
Used to create temporary directories.
Enums
All possible errors returned by this crate and the crates it depends on.
All errors specific to this crate.
Functions
List all the files in a local directory.
Download a file from a URL.
If necessary, retrieve a file from a URL, checking its hash.
Download a file from a URL and compute its hash.
Compute the hash (SHA256) of a local file.
A sample sample_file. Don’t use this. Instead, define your own sample_file
function
that knows how to fetch your data files.
Attribute Macros
Used to construct global FetchData instance.