Expand description
This crate provides a simple way to download files. In particular this crate aims to make it easy to download and cache files that do not change over time, for example reference image files, ML models, example audio files or common password lists.
§Downloading a file
As an example: To download the plaintext version of RFC 2068 you construct a
DownloadRequest with the URL and SHA-256 checksum and then use the
get function.
If you know that the file was already downloaded you can use get_cached.
use data_downloader::{get, get_cached, DownloadRequest};
// Define where to get the file from
let rfc_link = &DownloadRequest {
url: "https://www.rfc-editor.org/rfc/rfc2068.txt",
sha256_hash: &hex_literal::hex!(
"D6C4E471389F2D309AB1F90881576542C742F95B115336A346447D052E0477CF"
),
};
// Get the binary contents of the file
let rfc: Vec<u8> = get(rfc_link)?;
// Convert the file to a String
let as_text = String::from_utf8(rfc)?;
assert!(as_text.contains("The Hypertext Transfer Protocol (HTTP) is an application-level"));
assert!(as_text.contains("protocol for distributed, collaborative, hypermedia information"));
assert!(as_text.contains("systems."));
// Get the binary contents of the file directly from disk
let rfc: Vec<u8> = get_cached(rfc_link)?;get_path can be used to get a PathBuf to the file.
One of the design goals of this crate is to verify the integrity of the
downloaded files, as such the SHA-256 checksum of the downloads are checked.
If a file is loaded from the cache on disk the SHA-256 checksum is also
verified. However for get_path the checksum is not verified because even
if it was you would still be vulnerable to a TOC/TOU vulnerability.
The get, get_cached and get_path functions use a default
directory to cache the downloads, this allows multiple application to share
their cached downloads. If you need more configurability you can use
DownloaderBuilder and set the storage directory manually using
DownloaderBuilder::storage_dir. The default storage directory is a
platform specific cache directory or a platform specific temporary directory
if the cache directory is not available.
§Included DownloadRequests
The files module contains some predefined DownloadRequest for your
convenience.
§Pitfalls
When manually changing a DownloadRequest, inherently the SHA-256 sum
needs to be changed too. If this is not done this can result in a
DownloadRequest that looks as if it is downloading a specific file but
the download will never succeed because of the checksum mismatch, however
the wrong file can be loaded from cache. For example here the above
DownloadRequest was changed but only the url was adapted. Since
sha256_hash is not set to the correct value this will
return rfc2068.txt from the cache. This is a user error, as the developer
has to ensure that they specify the correct SHA-256 checksum for a
DownloadRequest.
let rfc7168 = &DownloadRequest {
url: "https://www.rfc-editor.org/rfc/rfc7168.txt",
sha256_hash: &hex_literal::hex!(
"D6C4E471389F2D309AB1F90881576542C742F95B115336A346447D052E0477CF"
),
};
let rfc2068 = &DownloadRequest {
url: "https://www.rfc-editor.org/rfc/rfc2068.txt",
sha256_hash: &hex_literal::hex!(
"D6C4E471389F2D309AB1F90881576542C742F95B115336A346447D052E0477CF"
),
};
assert_eq!(get(rfc7168)?, get(rfc2068)?);§ZIP Support
When the zip feature of this crate is enabled the [InZipDownloadRequest]
becomes available and can be used to download files contained in ZIP archive
files.
let request = InZipDownloadRequest {
parent: &DownloadRequest {
url: "https://github.com/tillarnold/data_downloader/archive/refs/tags/v0.1.0.zip",
sha256_hash: &hex_literal::hex!(
"3A1929ABF26E7E03A93206D1D068B36C3F7182F304CF317FD71766113DDA5C4E"
),
},
path: "data_downloader-0.1.0/src/files/ml/whisper_cpp.rs",
sha256_hash: &hex_literal::hex!(
"a6e18802876c198b9b99c33ce932890e57f01e0eab9ec19ac8ab2908025d1ae2"
),
};
let result = get(&request).unwrap();
let str = String::from_utf8(result).unwrap();
println!("{}", str);This example downloads an old version of this crate’s source code from github as a ZIP file and extracts an individual source file from it.
§Status of this crate
This is an early release. As such breaking changes are expected at some point. There are also some implementation limitations including but not limited to:
- The downloading is rather primitive. Failed downloads are simply retried a fixed number of times and no continuation of interrupted downloads is implemented.
- Only one URL is used per
DownloadRequest, it’s not currently possible to specify multiple possible locations for a file. - The crate uses blocking IO. As such there is no currently no WASM support.
Contributions to improve this are welcome.
Nevertheless this crate should be suitable for simple use cases.
§Dependencies
This crate uses the following dependencies:
dirsto find platform specific temporary and cache directories- Implementing this manually would only cause incompatibilities
reqwestto issue HTTP requests- A HTTP library is definitely required to allow this crate to download
files.
reqwestis widely used in the Rust community, it is however a rather big dependency as it is very fully featured. It might be worth investigating smaller HTTP client libraries in the future.
- A HTTP library is definitely required to allow this crate to download
files.
sha2to hash files- To ensure the integrity of the files a collision resistant
cryptographic hash function is required. SHA-256 is generally
considered as the standard for such a use case. The
sha2crate by theRustCryptoorganization is the defacto standard implementation of SHA-2 for Rust.
- To ensure the integrity of the files a collision resistant
cryptographic hash function is required. SHA-256 is generally
considered as the standard for such a use case. The
hex-literalto conveniently specify the SHA-256 sums- Technically this dependency could be removed if we specified the
SHA-256 in the predefined
DownloadRequestdirectly as&[u8]slice literals. However the library is maintained by theRustCryptoorganization and as such can be regarded as trustworthy.
- Technically this dependency could be removed if we specified the
SHA-256 in the predefined
thiserrorto conveniently deriveError- This library is also very widely used and maintained by David Tolnay ,
a highly regarded member of the Rust community. Once
data_downloaderhas sufficiently matured it might be a good idea to stop usingthiserrorand instead directly use the generated implementations in the code. This would potentially reduce build times. This has however low priority, especially while thecrate::Errortype is still changing frequently.
- This library is also very widely used and maintained by David Tolnay ,
a highly regarded member of the Rust community. Once
zipto unzip zip files (only enabled with thezipfeature)
Modules§
- files
DownloadRequests for useful files
Structs§
- Download
Request - A file to be downloaded
- Downloadable
- A thing that can be downloaded
- Downloader
- Configurable Downloader
- Downloader
Builder - A builder for constructing a
Downloader - Hash
Mismatch - A hash was not as expected
Enums§
- Error
- Error type for
data_downloader
Functions§
- get
- Get the file contents and if the file has not yet been downloaded, download it.
- get_
cached - Get the file contents and fail with an IO error if the file is not yet downloaded
- get_
path - Computes the full path to the file and if the file has not yet been downloaded, download it.