Expand description
This crate provides a simple way to download files. In particular this crate aims to make it easy to download and cache files that do not change over time, for example reference image files, ML models, example audio files or common password lists.
§Downloading a file
As an example: To download the plaintext version of RFC 2068 you construct a
DownloadRequest
with the URL and SHA-256 checksum and then use the
get
function.
If you know that the file was already downloaded you can use get_cached
.
use data_downloader::{get, get_cached, DownloadRequest};
// Define where to get the file from
let rfc_link = &DownloadRequest {
url: "https://www.rfc-editor.org/rfc/rfc2068.txt",
sha256_hash: &hex_literal::hex!(
"D6C4E471389F2D309AB1F90881576542C742F95B115336A346447D052E0477CF"
),
};
// Get the binary contents of the file
let rfc: Vec<u8> = get(rfc_link)?;
// Convert the file to a String
let as_text = String::from_utf8(rfc)?;
assert!(as_text.contains("The Hypertext Transfer Protocol (HTTP) is an application-level"));
assert!(as_text.contains("protocol for distributed, collaborative, hypermedia information"));
assert!(as_text.contains("systems."));
// Get the binary contents of the file directly from disk
let rfc: Vec<u8> = get_cached(rfc_link)?;
get_path
can be used to get a PathBuf
to the file.
One of the design goals of this crate is to verify the integrity of the
downloaded files, as such the SHA-256 checksum of the downloads are checked.
If a file is loaded from the cache on disk the SHA-256 checksum is also
verified. However for get_path
the checksum is not verified because even
if it was you would still be vulnerable to a TOC/TOU vulnerability.
The get
, get_cached
and get_path
functions use a default
directory to cache the downloads, this allows multiple application to share
their cached downloads. If you need more configurability you can use
DownloaderBuilder
and set the storage directory manually using
DownloaderBuilder::storage_dir
. The default storage directory is a
platform specific cache directory or a platform specific temporary directory
if the cache directory is not available.
§Included DownloadRequest
s
The files
module contains some predefined DownloadRequest
for your
convenience.
§Pitfalls
When manually changing a DownloadRequest
, inherently the SHA-256 sum
needs to be changed too. If this is not done this can result in a
DownloadRequest
that looks as if it is downloading a specific file but
the download will never succeed because of the checksum mismatch, however
the wrong file can be loaded from cache. For example here the above
DownloadRequest
was changed but only the url
was adapted. Since
sha256_hash
is not set to the correct value this will
return rfc2068.txt
from the cache. This is a user error, as the developer
has to ensure that they specify the correct SHA-256 checksum for a
DownloadRequest
.
let rfc7168 = &DownloadRequest {
url: "https://www.rfc-editor.org/rfc/rfc7168.txt",
sha256_hash: &hex_literal::hex!(
"D6C4E471389F2D309AB1F90881576542C742F95B115336A346447D052E0477CF"
),
};
let rfc2068 = &DownloadRequest {
url: "https://www.rfc-editor.org/rfc/rfc2068.txt",
sha256_hash: &hex_literal::hex!(
"D6C4E471389F2D309AB1F90881576542C742F95B115336A346447D052E0477CF"
),
};
assert_eq!(get(rfc7168)?, get(rfc2068)?);
§ZIP Support
When the zip
feature of this crate is enabled the [InZipDownloadRequest
]
becomes available and can be used to download files contained in ZIP archive
files.
let request = InZipDownloadRequest {
parent: &DownloadRequest {
url: "https://github.com/tillarnold/data_downloader/archive/refs/tags/v0.1.0.zip",
sha256_hash: &hex_literal::hex!(
"3A1929ABF26E7E03A93206D1D068B36C3F7182F304CF317FD71766113DDA5C4E"
),
},
path: "data_downloader-0.1.0/src/files/ml/whisper_cpp.rs",
sha256_hash: &hex_literal::hex!(
"a6e18802876c198b9b99c33ce932890e57f01e0eab9ec19ac8ab2908025d1ae2"
),
};
let result = get(&request).unwrap();
let str = String::from_utf8(result).unwrap();
println!("{}", str);
This example downloads an old version of this crate’s source code from github as a ZIP file and extracts an individual source file from it.
§Status of this crate
This is an early release. As such breaking changes are expected at some point. There are also some implementation limitations including but not limited to:
- The downloading is rather primitive. Failed downloads are simply retried a fixed number of times and no continuation of interrupted downloads is implemented.
- Only one URL is used per
DownloadRequest
, it’s not currently possible to specify multiple possible locations for a file. - The crate uses blocking IO. As such there is no currently no WASM support.
Contributions to improve this are welcome.
Nevertheless this crate should be suitable for simple use cases.
§Dependencies
This crate uses the following dependencies:
dirs
to find platform specific temporary and cache directories- Implementing this manually would only cause incompatibilities
reqwest
to issue HTTP requests- A HTTP library is definitely required to allow this crate to download
files.
reqwest
is widely used in the Rust community, it is however a rather big dependency as it is very fully featured. It might be worth investigating smaller HTTP client libraries in the future.
- A HTTP library is definitely required to allow this crate to download
files.
sha2
to hash files- To ensure the integrity of the files a collision resistant
cryptographic hash function is required. SHA-256 is generally
considered as the standard for such a use case. The
sha2
crate by theRustCrypto
organization is the defacto standard implementation of SHA-2 for Rust.
- To ensure the integrity of the files a collision resistant
cryptographic hash function is required. SHA-256 is generally
considered as the standard for such a use case. The
hex-literal
to conveniently specify the SHA-256 sums- Technically this dependency could be removed if we specified the
SHA-256 in the predefined
DownloadRequest
directly as&[u8]
slice literals. However the library is maintained by theRustCrypto
organization and as such can be regarded as trustworthy.
- Technically this dependency could be removed if we specified the
SHA-256 in the predefined
thiserror
to conveniently deriveError
- This library is also very widely used and maintained by David Tolnay ,
a highly regarded member of the Rust community. Once
data_downloader
has sufficiently matured it might be a good idea to stop usingthiserror
and instead directly use the generated implementations in the code. This would potentially reduce build times. This has however low priority, especially while thecrate::Error
type is still changing frequently.
- This library is also very widely used and maintained by David Tolnay ,
a highly regarded member of the Rust community. Once
zip
to unzip zip files (only enabled with thezip
feature)
Modules§
- files
DownloadRequest
s for useful files
Structs§
- Download
Request - A file to be downloaded
- Downloadable
- A thing that can be downloaded
- Downloader
- Configurable Downloader
- Downloader
Builder - A builder for constructing a
Downloader
- Hash
Mismatch - A hash was not as expected
Enums§
- Error
- Error type for
data_downloader
Functions§
- get
- Get the file contents and if the file has not yet been downloaded, download it.
- get_
cached - Get the file contents and fail with an IO error if the file is not yet downloaded
- get_
path - Computes the full path to the file and if the file has not yet been downloaded, download it.