datahugger

Tool for fetching data from DOI or URL.

Support data repositories:

Source	Website	Notes	Examples
Dataverse	dataverse.org	Supported Dataverse repositories	example
OSF	osf.io	—	example
GitHub ✨(new)	github.com	Use a GitHub API token to get a higher rate limit	example
Hugging Face ✨(new)	huggingface.co	—	example
arXiv	arxiv.org	—	example
Hal	hal.science	—	example
Zenodo	zenodo.org	—	example
Dryad	datadryad.org	Bearer token required to download data (see API instructions for obtaining your API key)	example
DataONE	dataone.org	Supported DataONE repositories; requests to its umbrella repositories may be slow	example

Open an issue if a data repository you want to use not yet support.

Install

prebuilt binaries via shell

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/EOSC-Data-Commons/datahugger-ng/releases/download/v0.5.2/datahugger-installer.sh | sh

powershell -ExecutionPolicy Bypass -c "irm https://github.com/EOSC-Data-Commons/datahugger-ng/releases/download/v0.5.2/datahugger-installer.ps1 | iex"

brew install unkcpz/tap/datahugger

cargo install datahugger

For downloading and use python library via,

pip install datahugger-ng

Usage

CLI

To download all data from a database, run:

datahugger download https://osf.io/3ua2c/ --to /tmp/a-blackhole

⠉ Crawling osfstorage/final_model_results_combined/single_species_models_final/niche_additive/Procyon lotor_2025-05-09.rdata...
⠲ Crawling osfstorage/final_model_results_combined/single_species_models_final/niche_additive...
⠈ Crawling osfstorage/final_model_results_combined/single_species_models_final...
⠒ Crawling osfstorage/final_model_results_combined...
⠐ Crawling osfstorage...
o/f/c/event-cbg-intersection.csv   [==>---------------------] 47.20 MB/688.21 MB (   4.92 MB/s,  2m)
o/f/m/a/Corvus corax.pdf           [=======>----------------] 80.47 kB/329.85 kB ( 438.28 kB/s,  1s)
o/f/m/a/Lynx rufus.pdf             [------------------------]      0 B/326.02 kB (       0 B/s,  0s)
o/f/m/a/Ursus arctos.pdf           [------------------------]      0 B/319.05 kB (       0 B/s,  0s)

See more examples at CLI usage examples.

Python

You can use it as a python library.

pip install datahugger-ng

Check python API docs for more examples.

from datahugger import resolve

ds = resolve(
    "https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KBHLOD"
)
ds.download_with_validation(dst_dir=tmp_path, limit=20)

assert sorted([i.name for i in tmp_path.iterdir()]) == [
    "ECM_matrix.py",
    "Markov_comp.py",
    "Markov_learning.py",
    "tutorial1.py",
    "tutorial2.py",
    "tutorial3.py",
    "tutorial4.py",
]

The download is very efficient because the underlying Rust implementation leverages all available CPU cores and maximizes the usage your bandwidth. Use the limit parameter to control concurrency; by default, it is set to 0, which means no limit.

Besides the API for download files in a dataset, we also provide a low-level Python API for implementing custom operations after files are crawled. Crawl datasets efficiently and asynchronously with our Rust-powered crawler -- fully utilizing all CPU cores and your network bandwidth.
Simply resolve a dataset and stream its entries with async for and deal with entries concurrently as they arrive:

import asyncio
from datahugger import resolve, DOIResolver

async def main():
    doi_resolver = DOIResolver(timeout=30)
    url = doi_resolver.resolve("10.34894/0B7ZLK")
    ds = resolve(url)
    async for entry in ds.crawl_file():
        # print or any async operation on the returned entry
        print("crawl:", entry)

asyncio.run(main())

Rust SDK

trait DatasetBackend for adding support for new data repository in your own rust crate.
impl Dataset interface for adding new operations in your own crate.

Python SDK

Python SDK mainly for downstream python libraries to implement extra operations on files (e.g. store metadata into DB).

See python api doc for more details.

caveats:

Following architecture not yet able to install from pypi.

target: s390x

CLI Examples

GitHub - avoid hitting API rate limits using a Personal Access Token (PAT)

To get higher rate limits, export your GitHub PAT before downloading: If you use gh auth token to get token if you use gh to login in CLI.

https://github.com/EOSC-Data-Commons/datahugger-ng

export GITHUB_TOKEN="your_personal_access_token" 
datahugger download https://github.com/EOSC-Data-Commons/datahugger-ng --to /tmp/github_download/

Datadryad API key config and download

Datadryad requires a bearer token to access data. First, follow API instructions to get your key. You need to have a dryad account and in your profile you can find your API secret, it by default expire in 10 hours.

https://datadryad.org/dataset/doi:10.5061/dryad.mj8m0

export DRYAD_API_KEY="your_api_key"
datahugger download https://datadryad.org/dataset/doi:10.5061/dryad.mj8m0 --to /tmp/dryad_download/

Datasets without limitations

Huggingface datasets - simple download

https://huggingface.co/datasets/HuggingFaceFW/finepdfs

datahugger download https://huggingface.co/datasets/HuggingFaceFW/finepdfs --to /tmp/hf_download/

Dataverse - simple download

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KBHLOD

datahugger download https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KBHLOD --to /tmp/dataverse_download/

OSF - simple download

datahugger download https://osf.io/3ua2c --to /tmp/osf_download/ --limit 10

arXiv - simple download

https://arxiv.org/abs/2101.00001v1

datahugger download https://arxiv.org/abs/2101.00001v1 --to /tmp/arxiv_download/

Zenodo - simple download

https://zenodo.org/records/17867222

datahugger download https://zenodo.org/record/17867222 --to /tmp/zenodo_download/

Hal.science

datahugger download https://hal.science/cel-01830944 --to /tmp/hal_download/

DataONE - may be slow for umbrella repositories

https://arcticdata.io/catalog/view/doi%3A10.18739%2FA2542JB2X

datahugger download https://arcticdata.io/catalog/view/doi%3A10.18739%2FA2542JB2X --to /tmp/dataone_download/

Notes:
--to /tmp/... shows the download target directory.
--limit 10 limit the concurrency channel to do polite crawling and downloading.
Datasets from data repositories which have rate limits or auth are highlighted with PAT / API key instructions.
Others can be downloaded directly without credentials.

Roadmap

asynchronously stream file crawling results into the pipeline with exceptional performance
resolver to resolve url to repository record handlers.
expressive progress bar when binary running in CLI.
clear interface to add support for data repositories that lack machine-readable API specifications (e.g., no FAIRiCAT or OAI-PMH support).
devops, ci with both rust and python tests.
devops, json_extract helper function for serde json value easy value resolve from path.
clear interface for adding crawling results dealing operations beyond download.
strong error handling mechanism and logging to avoid interruptions (using exn crate).
Sharable client connection to reduce the cost of repeated reconnections.
automatically resolve doi into dateset source url.
do detail benchs to show its power (might not need, the cli download already ~1000 times faster for example for dataset https://osf.io/3ua2c/).
single-pass streaming with computing checksum by plug a hasher in the pipeline.
all repos that already supported by py-datahugger
- Dataone (the repos itself are verry slow in responding http request).
- Github repo download (support folders collapse and download).
- zenodo
- datadryad
- arxiv
- MendelyDataset
- HuggingFaceDataset
- HAL
- CERNBox
- OSFDataset
- Many Dataverse dataset
- Bgee Database
compact but extremly expressive readme
- crate.io + python docs.
- a bit detail of data repo, shows if fairicat is support etc.
- at crate.io, show how to use generics to add new repos or new ops.
test python bindings in filemetrix/filefetcher.
rust api doc on docs.rs
doc on gh-pages?
python binding (crawl function) that spit out a stream for async use in python side.
python binding allow to set HTTP client from a config, or set a token etc.
zip extract support.
onedata support through signposting, fairicat?
not only download, but a versatile metadata fetcher
not only download, but scanning to get compute the file type using libmagic.
one eosc target data repo support that not include in original py-datahugger (HAL?)
use this to build a fairicat converter service to dogfooding.
python bindings
cli that can do all py-datahugger do.
not only local FS, but s3 (using openDAL?)
seamephor, config that can intuitively estimate maximum resources been used (already partially taken care by for_each_concurrent limit).
supports for less popular data repositories, implement when use cases coming (need your help!)
- FigShareDataset (https://api.figshare.com/v2)
- DSpaceDataset
- SeaNoeDataset
- PangaeaDataset
- B2ShareDataset
- DjehutyDataset

Development

The development environment can be managed with devenv using nix. Enter a full environment with:

devenv shell -v

You can also use your own Rust setup, we don't enforce or test a specific Rust MSRV yet.

Make new Release

For pypi release:

update version number at python/Cargo.toml. The version don't need to sync with rust crate version.
trigger manually at CI workflow pypi-publish

For binary release and for crates.io release, they share same version number.

# commit and push to main (can be done with a PR)
git commit -am "release: version 0.1.0"
git push

# actually push the tag up (this triggers dist's CI)
git tag v0.1.0
git push --tags

CI workflow of crates.io build can be trigger manually at CI workflow crate-publish. But it will not run the final crates.io upload.

Ack

this project was originally inspired by https://github.com/J535D165/datahugger.

License

All contributions must retain this attribution.

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

datahugger 0.5.2