# `RefMan`: a simple biological reference manager
Bioinformatics involves juggling lots of files, particularly reference datasets (FASTA, GenBank, EMBL, Oh My!) with associated annotation and genomic range data. `refman` evolved out of [our internal](https://dho.pathology.wisc.edu/) desire to simplify accessing references from many sources--both official and bespoke--as well as my own desire to write more Rustπ¦.
`refman` can be thought of as a simpler and less general-purpose implementation of what [SciDataFlow](https://github.com/vsbuffalo/scidataflow) does. For uses cases beyond getting a few reference datasets from disparate places, I highly recommend giving SciDataFlow a try. But if you're like me and your head is spinning with all the different combinations of all the reference datasets each of your projects needs, and you want a fast way of pulling these combinations together, `refman` is for you!
## Installation
### Precompiled Binary Releases - _Coming soon!_
Precompiled static binaries for a variety of platforms ~~are~~ will soon be available in [`refman`'s Github releases](https://github.com/nrminor/refman/releases).
### Crates.io
Since v1.0.0, `refman` has been available on [crates.io](https://crates.io/crates/refman). Assuming you have the [Rust toolchain](https://www.rust-lang.org/tools/install) installed, simply install it with `cargo install refman`.
### Build from source
If you have git available, are on a unix system, and have the [Rust toolchain](https://www.rust-lang.org/tools/install) installed, you can download and compile `refman` from source with the following:
```bash
# download the source code with git clone
git clone https://github.com/nrminor/refman
# change into the project root directory
cd refman
# compile and install it onto your $PATH with cargo
cargo install --path="."
```
### Use it through Python
`refman` also has a Python API that can soon be installed with `pip install py-refman`. See [the `py-refman` Github repository](https://github.com/nrminor/py-refman) for more information about its usage.
## Quick Start
`refman` centers around a workflow of three subcommands: `refman init`, `refman register`, and `refman download`. These commands initialize a project with metadata, register dataset URLs, and download datasets respectively. Use `--help` or `-h` on each subcommand to explore the command line interface. The top-level interface will look like this when you run `refman -h`:
```
β βββ ββ ββ ββββ βββ βββ βββ β
β ββββ ββ ββββββββ ββββββββ ββ ββ ββββ ββ ββ β
β βββ ββββ ββββ ββ ββββ ββ β β β
β βββ βββ ββββββββ ββββββββ β β ββ ββ ββ β
β ββββ ββ ββ ββββββββ ββββ ββ ββββ ββ βββ β
refman (v1.0.0)
------------------------------------------------------------
`refman` is a simple command-line tool for managing biological reference datasets often
used in bioinformatics. These datasets may include raw sequence files, files encoding
annotations on those sequences, etc. `refman` makes it easier to manage and download
these kinds of files globally on the user's machine, or on a per-project basis. It
uses a human-readable TOML file to track which files it's managing, which can be shared
between users to aid scientific reproducibility.
Usage: refman [OPTIONS] [COMMAND]
Commands:
init Initialize a registry for the current project without registering any datasets. [aliases: i, new]
register Register a new file or set of files with a given dataset label. [aliases: r, reg]
remove Remove the files associated with a given dataset label [aliases: rm]
list List all previously registered reference datasets [aliases: l]
download Download one or many reference datasets registered in the refman registry. [aliases: d, dl, down, get, fetch]
help Print this message or the help of the given subcommand(s)
Options:
-v, --verbose... Increase logging verbosity
-q, --quiet... Decrease logging verbosity
-h, --help Print help (see more with '--help')
-V, --version Print version
```
## Detailed Usage
`refman`'s first deployment was in the bioinformatic pipeline [`oneroof`](https://github.com/nrminor/oneroof), which is run routinely with different reference datasets depending on the input data. These datasets were registered in the pipeline's [`refman.toml`](https://github.com/nrminor/oneroof/blob/main/refman.toml) file with the same series of commands that would be used with any project. To demonstrate this workflow, those commands are reproduced here.
#### Project Initialization
First, to use `refman` as part of `oneroof`'s setup, we initialize a project, like so:
```bash
refman init -t oneroof -d "Reference files needed for routine runs on viral amplicon data from SARS-CoV-2 and H5N1"
```
This will create a `refman.toml` that looks like this:
```toml
[project]
title = "oneroof"
description = "Reference files needed for routine runs on viral amplicon data from SARS-CoV-2 and H5N1"
last_modified = "2025-03-19T17:24:04.673076Z"
global = false
datasets = []
```
No datasets have been registered yet. This `refman.toml` also uses the tool's default behavior, which includes making each `refman.toml` local to a project instead of global to a machine. Support for global usage will increase with time, but in general we recommend that `refman` is used on a per-project basis to avoid confusion.
#### Registering URLs with Datasets
Next, datasets for a few `oneroof` configurations were registered, like so:
```bash
# first, datasets for running oneroof on SARS-CoV-2 amplicons with the QIASeq Direct Enhanced Boosted primer set
refman register -l "sc2-qiaseq" \
--fasta "https://dholk.primate.wisc.edu/_webdav/dho/public/DHO%20Lab%20Bespoke%20Reference%20Dataset%20Registry/Pathogen%20Genomics/%40files/sars-cov-2/MN908947.3.fasta" \
--genbank "https://dholk.primate.wisc.edu/_webdav/dho/public/DHO%20Lab%20Bespoke%20Reference%20Dataset%20Registry/Pathogen%20Genomics/%40files/sars-cov-2/MN908947.3.gbk" \
--gff "https://dholk.primate.wisc.edu/_webdav/dho/public/DHO%20Lab%20Bespoke%20Reference%20Dataset%20Registry/Pathogen%20Genomics/%40files/sars-cov-2/MN908947.3_corrected_orf1.gff" \
--bed "https://dholk.primate.wisc.edu/_webdav/dho/public/DHO%20Lab%20Bespoke%20Reference%20Dataset%20Registry/Pathogen%20Genomics/%40files/sars-cov-2/qiaseq_direct_boosted.bed"
# second, datasets for H5N1 amplicons from our own bespoke H5N1 tiled primer set
refman register -l "h5n1-B-custom" \
--fasta "https://dholk.primate.wisc.edu/_webdav/dho/public/DHO%20Lab%20Bespoke%20Reference%20Dataset%20Registry/Pathogen%20Genomics/%40files/H5N1-B.3.13/custom_reference.fasta?contentDisposition=attachment" \
--genbank "https://dholk.primate.wisc.edu/_webdav/dho/public/DHO%20Lab%20Bespoke%20Reference%20Dataset%20Registry/Pathogen%20Genomics/%40files/H5N1-B.3.13/annotation-custom.gbk?contentDisposition=attachment" \
--bed "https://dholk.primate.wisc.edu/_webdav/dho/public/DHO%20Lab%20Bespoke%20Reference%20Dataset%20Registry/Pathogen%20Genomics/%40files/H5N1-B.3.13/final_truth_no_dashes.bed?contentDisposition=attachment"
# and third, a simpler dataset for H5N1 whole-segment amplicons for sequencing on Oxford Nanopore instruments
refman register -l "h5n1-B-segmental" \
--fasta "https://dholk.primate.wisc.edu/_webdav/dho/public/DHO%20Lab%20Bespoke%20Reference%20Dataset%20Registry/Pathogen%20Genomics/%40files/H5N1-B.3.13/h5_cattle_genome_root_segments.fasta?contentDisposition=attachment"
```
Note that, as documented in the help menu for `refman`, `reg` and `r` are aliases for the `register` subcommand. All `refman` subcommands have shorthand aliases. Also, URLs _must be provided between quotes_.
Before completing the registration process, `refman` uses the Rust [`lychee` library](https://crates.io/crates/lychee-lib) to check that each provided URL is valid and points to a resource that exists. This prevents invalid entries to `refman.toml` when managed through the command-line interface.
#### Downloading Datasets
Once these datasets are registered, they can later be deserialized from `refman.toml` and used to download those resources. For `oneroof`, this most often involves downloading datasets for a SARS-CoV-2 run, like so:
```bash
refman download sc2-qiaseq -d assets
```
This will download all the files in the dataset labeled "sc2-qiaseq" (registered above) and place them in a destination directory called "assets". Like in the `register` subcommand, URLs will be checked for validity before being used to download files. Note that the dataset label used is case-sensitive and must exactly match a dataset registered with `refman`.
Keep in mind that if you're coming to a new project with datasets managed with `refman`, you can always list what's available with `refman list`, and list full URLs for particular projects with `refman list <LABEL>`.
## Roadmap
`refman` reached v1.0.0 as a minimum viable product, but it's still a work in progress. Features on the roadmap include:
- [ ] more black-box-, white-box-, and especially doc-tests
- [x] link-checking with [lychee_lib](https://docs.rs/lychee-lib/latest/lychee_lib/)
- [x] download progress bars
- [ ] a GitHub workflow for generating releases so that static binaries are available
- [x] publication on [crates.io](https://crates.io/)
- [ ] the ability to symlink locations in the local filesystem as opposed to only pulling from the internet
- [ ] expanded metadata fields or file formats, e.g., VCFs
- [ ] validation that an entry in a given file format is actually that format
- [ ] file checksums
- [ ] tasks/rules that tell `refman` to do some operation on a file once it's downloaded, potentially in an embedded scripting language like [Lua](https://www.lua.org/)
- [ ] stable config file format
- [ ] a global dotfile format with higher precedence than the tool's current defaults
- [x] [a python API](https://github.com/nrminor/py-refman) with a slightly restricted feature set that is [pip-installable]()
If you're interested in speeding any of these or other features along, or find any bugs, please reach out in [the repo's issues](https://github.com/nrminor/refman/issues)!
#### Non-goals
1. Maximal performance. The `refman` code contains a few clones here and there where it's convenient, though never for large amounts of data.
2. Minimal dependencies. I've used this project in part to explore some interesting [crates](https://crates.io/) from the Rust ecosystem that I haven't used previously.
## Citation(s)
Coming soon.