Crate data_downloader

Source
Expand description

This crate provides a simple way to download files. In particular this crate aims to make it easy to download and cache files that do not change over time, for example reference image files, ML models, example audio files or common password lists.

§Downloading a file

As an example: To download the plaintext version of RFC 2068 you construct a DownloadRequest with the URL and SHA-256 checksum and then use the get function.

If you know that the file was already downloaded you can use get_cached.

use data_downloader::{get, get_cached, DownloadRequest};

// Define where to get the file from
let rfc_link = &DownloadRequest {
    url: "https://www.rfc-editor.org/rfc/rfc2068.txt",
    sha256_hash: &hex_literal::hex!(
        "D6C4E471389F2D309AB1F90881576542C742F95B115336A346447D052E0477CF"
    ),
};

// Get the binary contents of the file
let rfc: Vec<u8> = get(rfc_link)?;

// Convert the file to a String
let as_text = String::from_utf8(rfc)?;
assert!(as_text.contains("The Hypertext Transfer Protocol (HTTP) is an application-level"));
assert!(as_text.contains("protocol for distributed, collaborative, hypermedia information"));
assert!(as_text.contains("systems."));

// Get the binary contents of the file directly from disk
let rfc: Vec<u8> = get_cached(rfc_link)?;

get_path can be used to get a PathBuf to the file.

One of the design goals of this crate is to verify the integrity of the downloaded files, as such the SHA-256 checksum of the downloads are checked. If a file is loaded from the cache on disk the SHA-256 checksum is also verified. However for get_path the checksum is not verified because even if it was you would still be vulnerable to a TOC/TOU vulnerability.

The get, get_cached and get_path functions use a default directory to cache the downloads, this allows multiple application to share their cached downloads. If you need more configurability you can use DownloaderBuilder and set the storage directory manually using DownloaderBuilder::storage_dir. The default storage directory is a platform specific cache directory or a platform specific temporary directory if the cache directory is not available.

§Included DownloadRequests

The files module contains some predefined DownloadRequest for your convenience.

§Pitfalls

When manually changing a DownloadRequest, inherently the SHA-256 sum needs to be changed too. If this is not done this can result in a DownloadRequest that looks as if it is downloading a specific file but the download will never succeed because of the checksum mismatch, however the wrong file can be loaded from cache. For example here the above DownloadRequest was changed but only the url was adapted. Since sha256_hash is not set to the correct value this will return rfc2068.txt from the cache. This is a user error, as the developer has to ensure that they specify the correct SHA-256 checksum for a DownloadRequest.

let rfc7168 = &DownloadRequest {
    url: "https://www.rfc-editor.org/rfc/rfc7168.txt",
    sha256_hash: &hex_literal::hex!(
        "D6C4E471389F2D309AB1F90881576542C742F95B115336A346447D052E0477CF"
    ),
};

let rfc2068 = &DownloadRequest {
    url: "https://www.rfc-editor.org/rfc/rfc2068.txt",
    sha256_hash: &hex_literal::hex!(
        "D6C4E471389F2D309AB1F90881576542C742F95B115336A346447D052E0477CF"
    ),
};

assert_eq!(get(rfc7168)?, get(rfc2068)?);

§ZIP Support

When the zip feature of this crate is enabled the [InZipDownloadRequest] becomes available and can be used to download files contained in ZIP archive files.

let request = InZipDownloadRequest {
    parent: &DownloadRequest {
        url: "https://github.com/tillarnold/data_downloader/archive/refs/tags/v0.1.0.zip",
        sha256_hash: &hex_literal::hex!(
            "3A1929ABF26E7E03A93206D1D068B36C3F7182F304CF317FD71766113DDA5C4E"
        ),
    },
    path: "data_downloader-0.1.0/src/files/ml/whisper_cpp.rs",
    sha256_hash: &hex_literal::hex!(
        "a6e18802876c198b9b99c33ce932890e57f01e0eab9ec19ac8ab2908025d1ae2"
    ),
};
let result = get(&request).unwrap();
let str = String::from_utf8(result).unwrap();
println!("{}", str);

This example downloads an old version of this crate’s source code from github as a ZIP file and extracts an individual source file from it.

§Status of this crate

This is an early release. As such breaking changes are expected at some point. There are also some implementation limitations including but not limited to:

  • The downloading is rather primitive. Failed downloads are simply retried a fixed number of times and no continuation of interrupted downloads is implemented.
  • Only one URL is used per DownloadRequest, it’s not currently possible to specify multiple possible locations for a file.
  • The crate uses blocking IO. As such there is no currently no WASM support.

Contributions to improve this are welcome.

Nevertheless this crate should be suitable for simple use cases.

§Dependencies

This crate uses the following dependencies:

  • dirs to find platform specific temporary and cache directories
    • Implementing this manually would only cause incompatibilities
  • reqwest to issue HTTP requests
    • A HTTP library is definitely required to allow this crate to download files. reqwest is widely used in the Rust community, it is however a rather big dependency as it is very fully featured. It might be worth investigating smaller HTTP client libraries in the future.
  • sha2 to hash files
    • To ensure the integrity of the files a collision resistant cryptographic hash function is required. SHA-256 is generally considered as the standard for such a use case. The sha2 crate by the RustCrypto organization is the defacto standard implementation of SHA-2 for Rust.
  • hex-literal to conveniently specify the SHA-256 sums
    • Technically this dependency could be removed if we specified the SHA-256 in the predefined DownloadRequest directly as &[u8] slice literals. However the library is maintained by the RustCrypto organization and as such can be regarded as trustworthy.
  • thiserror to conveniently derive Error
    • This library is also very widely used and maintained by David Tolnay , a highly regarded member of the Rust community. Once data_downloader has sufficiently matured it might be a good idea to stop using thiserror and instead directly use the generated implementations in the code. This would potentially reduce build times. This has however low priority, especially while the crate::Error type is still changing frequently.
  • zip to unzip zip files (only enabled with the zip feature)

Modules§

files
DownloadRequests for useful files

Structs§

DownloadRequest
A file to be downloaded
Downloadable
A thing that can be downloaded
Downloader
Configurable Downloader
DownloaderBuilder
A builder for constructing a Downloader
HashMismatch
A hash was not as expected

Enums§

Error
Error type for data_downloader

Functions§

get
Get the file contents and if the file has not yet been downloaded, download it.
get_cached
Get the file contents and fail with an IO error if the file is not yet downloaded
get_path
Computes the full path to the file and if the file has not yet been downloaded, download it.