datahugger
Tool for fetching data from DOI or URL.
Support data repositories:
| Source | Website | Notes | Examples |
|---|---|---|---|
| Dataverse | dataverse.org | Supported Dataverse repositories | example |
| OSF | osf.io | — | example |
| GitHub ✨(new) | github.com | Use a GitHub API token to get a higher rate limit | example |
| Hugging Face ✨(new) | huggingface.co | — | example |
| arXiv | arxiv.org | — | example |
| Hal | hal.science | — | example |
| Zenodo | zenodo.org | — | example |
| Dryad | datadryad.org | Bearer token required to download data (see API instructions for obtaining your API key) | example |
| DataONE | dataone.org | Supported DataONE repositories; requests to its umbrella repositories may be slow | example |
Open an issue if a data repository you want to use not yet support.
Install
prebuilt binaries via shell
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/EOSC-Data-Commons/datahugger-ng/releases/download/v0.5.4/datahugger-installer.sh | sh
powershell -ExecutionPolicy Bypass -c "irm https://github.com/EOSC-Data-Commons/datahugger-ng/releases/download/v0.5.4/datahugger-installer.ps1 | iex"
brew install unkcpz/tap/datahugger
cargo install datahugger
For downloading and use python library via,
pip install datahugger-ng
Usage
CLI
To download all data from a database, run:
datahugger download https://osf.io/3ua2c/ --to /tmp/a-blackhole
⠉ Crawling osfstorage/final_model_results_combined/single_species_models_final/niche_additive/Procyon lotor_2025-05-09.rdata...
⠲ Crawling osfstorage/final_model_results_combined/single_species_models_final/niche_additive...
⠈ Crawling osfstorage/final_model_results_combined/single_species_models_final...
⠒ Crawling osfstorage/final_model_results_combined...
⠐ Crawling osfstorage...
o/f/c/event-cbg-intersection.csv [==>---------------------] 47.20 MB/688.21 MB ( 4.92 MB/s, 2m)
o/f/m/a/Corvus corax.pdf [=======>----------------] 80.47 kB/329.85 kB ( 438.28 kB/s, 1s)
o/f/m/a/Lynx rufus.pdf [------------------------] 0 B/326.02 kB ( 0 B/s, 0s)
o/f/m/a/Ursus arctos.pdf [------------------------] 0 B/319.05 kB ( 0 B/s, 0s)
See more examples at CLI usage examples.
Python
You can use it as a python library.
pip install datahugger-ng
Check python API docs for more examples.
=
assert ==
The download is very efficient because the underlying Rust implementation leverages all available CPU cores and maximizes the usage your bandwidth.
Use the limit parameter to control concurrency; by default, it is set to 0, which means no limit.
Besides the API for download files in a dataset, we also provide a low-level Python API for implementing custom operations after files are crawled.
Crawl datasets efficiently and asynchronously with our Rust-powered crawler -- fully utilizing all CPU cores and your network bandwidth.
Simply resolve a dataset and stream its entries with async for and deal with entries concurrently as they arrive:
=
=
=
# print or any async operation on the returned entry
Rust SDK
trait DatasetBackendfor adding support for new data repository in your own rust crate.impl Datasetinterface for adding new operations in your own crate.
Python SDK
Python SDK mainly for downstream python libraries to implement extra operations on files (e.g. store metadata into DB).
See python api doc for more details.
caveats:
Following architecture not yet able to install from pypi.
- target: s390x
CLI Examples
GitHub - avoid hitting API rate limits using a Personal Access Token (PAT)
To get higher rate limits, export your GitHub PAT before downloading:
If you use gh auth token to get token if you use gh to login in CLI.
https://github.com/EOSC-Data-Commons/datahugger-ng
Datadryad API key config and download
Datadryad requires a bearer token to access data. First, follow API instructions to get your key. You need to have a dryad account and in your profile you can find your API secret, it by default expire in 10 hours.
https://datadryad.org/dataset/doi:10.5061/dryad.mj8m0
Datasets without limitations
- Huggingface datasets - simple download
https://huggingface.co/datasets/HuggingFaceFW/finepdfs
- Dataverse - simple download
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KBHLOD
- OSF - simple download
- arXiv - simple download
https://arxiv.org/abs/2101.00001v1
- Zenodo - simple download
https://zenodo.org/records/17867222
- Hal.science
- DataONE - may be slow for umbrella repositories
https://arcticdata.io/catalog/view/doi%3A10.18739%2FA2542JB2X
-
Notes:
-
--to /tmp/...shows the download target directory. -
--limit 10limit the concurrency channel to do polite crawling and downloading. -
Datasets from data repositories which have rate limits or auth are highlighted with PAT / API key instructions.
-
Others can be downloaded directly without credentials.
Roadmap
- asynchronously stream file crawling results into the pipeline with exceptional performance
- resolver to resolve url to repository record handlers.
- expressive progress bar when binary running in CLI.
- clear interface to add support for data repositories that lack machine-readable API specifications (e.g., no FAIRiCAT or OAI-PMH support).
- devops, ci with both rust and python tests.
- devops,
json_extracthelper function for serde json value easy value resolve from path. - clear interface for adding crawling results dealing operations beyond download.
- strong error handling mechanism and logging to avoid interruptions (using
exncrate). - Sharable client connection to reduce the cost of repeated reconnections.
- automatically resolve doi into dateset source url.
- do detail benchs to show its power (might not need, the cli download already ~1000 times faster for example for dataset https://osf.io/3ua2c/).
- single-pass streaming with computing checksum by plug a hasher in the pipeline.
- all repos that already supported by py-datahugger
- Dataone (the repos itself are verry slow in responding http request).
- Github repo download (support folders collapse and download).
- zenodo
- datadryad
- arxiv
- MendelyDataset
- HuggingFaceDataset
- HAL
- CERNBox
- OSFDataset
- Many Dataverse dataset
- Bgee Database
- compact but extremly expressive readme
- crate.io + python docs.
- a bit detail of data repo, shows if fairicat is support etc.
- at crate.io, show how to use generics to add new repos or new ops.
- test python bindings in filemetrix/filefetcher.
- rust api doc on docs.rs
- doc on gh-pages?
- python binding (crawl function) that spit out a stream for async use in python side.
- python binding allow to set HTTP client from a config, or set a token etc.
- zip extract support.
- onedata support through signposting, fairicat?
- not only download, but a versatile metadata fetcher
- not only download, but scanning to get compute the file type using libmagic.
- one eosc target data repo support that not include in original py-datahugger (HAL?)
- use this to build a fairicat converter service to dogfooding.
- python bindings
- cli that can do all py-datahugger do.
- not only local FS, but s3 (using openDAL?)
- seamephor, config that can intuitively estimate maximum resources been used (already partially taken care by for_each_concurrent limit).
- supports for less popular data repositories, implement when use cases coming (need your help!)
- FigShareDataset (https://api.figshare.com/v2)
- DSpaceDataset
- SeaNoeDataset
- PangaeaDataset
- B2ShareDataset
- DjehutyDataset
Development
The development environment can be managed with devenv using nix. Enter a full environment with:
devenv shell -v
You can also use your own Rust setup, we don't enforce or test a specific Rust MSRV yet.
Make new Release
For pypi release:
- update version number at
python/Cargo.toml. The version don't need to sync with rust crate version. - trigger manually at CI workflow
pypi-publish
For binary release and for crates.io release, they share same version number.
# commit and push to main (can be done with a PR)
git commit -am "release: version 0.1.0"
git push
# actually push the tag up (this triggers dist's CI)
git tag v0.1.0
git push --tags
CI workflow of crates.io build can be trigger manually at CI workflow crate-publish.
But it will not run the final crates.io upload.
Ack
- this project was originally inspired by https://github.com/J535D165/datahugger.
License
All contributions must retain this attribution.
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)