biov 0.1.13

A uv-style tool manager for bioinformatics: reproducible Docker-backed tools with digest-pinned lockfiles (installs as `bv`)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
# bv

**A `uv`-style tool manager for bioinformatics.**

`bv` installs bioinformatics tools as containers, pins them to exact digests in a lockfile, and makes any analysis environment reproducible with a single `bv sync`. Works with Docker on laptops and Apptainer/Singularity on HPC clusters; the same manifest, the same lockfile, either backend.

```sh
bv add blast hmmer mmseqs2      # resolve from registry, pull images
bv run blastn -version          # call any binary by name directly
bv exec snakemake --cores 4     # run scripts with all tools on PATH
bv sync                         # reproduce the exact environment anywhere
```

---

## Quickstart

Requires Docker or Apptainer/Singularity and `git`. No other dependencies.

### Install a runtime

Pick whichever fits your machine. Docker is typical on laptops; Apptainer is typical on shared HPC nodes.

```sh
# Docker (rootless, Linux). On a GPU box you'll also want nvidia-container-toolkit.
curl -fsSL https://get.docker.com/rootless | sh
systemctl --user enable --now docker

# Apptainer (no root needed, works on most HPC clusters)
conda install -c conda-forge apptainer
```

### Install bv

```sh
curl -fsSL https://raw.githubusercontent.com/mlberkeley/bv/main/install.sh | sh
```

Or with Cargo:

```sh
cargo install biov
```

### Five commands to a reproducible analysis

```sh
bv doctor                    # check environment and available runtimes

bv add blast hmmer           # pull tools, write bv.toml and bv.lock

bv run blastn -version       # call binaries directly by name
bv run hmmbuild -h

bv list                      # show installed tools with tier and digest

bv sync                      # on any other machine: reproduce exactly
```

### Example: homology search pipeline (two tools)

`bv run` mounts your current directory as `/workspace` inside the container.

```sh
mkdir homology-project && cd homology-project

# Download a sample protein sequence (human p53, ~400 aa)
curl -sL "https://rest.uniprot.org/uniprotkb/P04637.fasta" -o p53.fasta

# Add both tools at once
bv add blast hmmer

# Step 1: build a BLAST protein database
bv run makeblastdb \
    -in /workspace/p53.fasta \
    -dbtype prot \
    -out /workspace/p53_db

# Step 2: BLAST search (tabular output)
bv run blastp \
    -query /workspace/p53.fasta \
    -db    /workspace/p53_db \
    -out   /workspace/blast_hits.tsv \
    -outfmt 6

# Step 3: build an HMM profile from the BLAST hits
bv run hmmbuild /workspace/p53.hmm /workspace/p53.fasta

# Step 4: search with the HMM profile
bv run hmmsearch \
    /workspace/p53.hmm \
    /workspace/p53.fasta \
    > /workspace/hmmer_hits.txt

cat blast_hits.tsv
```

`bv run <binary>` looks up the binary name in the project's binary index and routes to the right container automatically. No need to specify the tool name.

Your project directory:

```
homology-project/
  bv.toml          # declares blast + hmmer
  bv.lock          # pinned image digests and binary index
  .bv/bin/         # generated shims (gitignored)
  p53.fasta
  p53_db.*         # BLAST database files
  blast_hits.tsv
  p53.hmm
  hmmer_hits.txt
```

Commit the project files; collaborators reproduce the exact environment:

```sh
git add bv.toml bv.lock
git commit -m "pin analysis environment"

# On another machine:
git clone <your-repo> && cd homology-project
bv sync          # pulls exact pinned images by digest, regenerates shims
bv run blastp -query /workspace/p53.fasta ...
```

---

## Using tools from scripts and pipelines

### bv exec

`bv exec` runs any command with all project binaries prepended to `PATH`. It is the right form for scripts, Makefiles, and CI.

```sh
bv exec python3 pipeline.py
bv exec snakemake --cores 4
bv exec -- bash -c "blastn -query foo.fa | sort -k11 -n"
```

On Unix, `bv exec` replaces itself with the child process via `exec(2)`. Signals, exit codes, and HPC schedulers see the child directly; there is no extra layer in `ps`.

Makefile:

```make
results.tsv: query.fa db.phr
	bv exec blastn -query $< -db db -out $@ -outfmt 6
```

Snakemake:

```python
rule align:
    input:  "reads.fastq.gz"
    output: "aligned.bam"
    shell:
        "bv exec bwa mem -t {threads} ref.fa {input} "
        "| bv exec samtools sort -o {output}"
```

### bv shell

`bv shell` starts an interactive subshell with all project binaries on PATH. The prompt changes to show the active project.

```sh
bv shell
(bv:homology-project) $ blastn -query p53.fasta -db p53_db -out hits.tsv -outfmt 6
(bv:homology-project) $ hmmsearch p53.hmm p53.fasta > hmmer_hits.txt
(bv:homology-project) $ exit
$
```

Exiting the subshell returns to the original environment cleanly. `BV_ACTIVE` is set to the project name while inside, so scripts can detect activation.

```sh
bv shell --shell zsh    # explicit shell choice
```

### Binary routing

Every binary a tool exposes is listed in `bv.lock` and gets a shim in `.bv/bin/`. `bv run <binary>` and `bv exec <binary>` both route through this index.

```sh
bv list --binaries
```

```
  Binary        Tool
  ----------------------------
  blastn        blast 2.15.0
  blastp        blast 2.15.0
  makeblastdb   blast 2.15.0
  tblastn       blast 2.15.0
  hmmbuild      hmmer 3.3.2
  hmmsearch     hmmer 3.3.2
  hmmscan       hmmer 3.3.2
```

If two tools expose the same binary name, `bv lock` fails with a clear error. Resolve it in `bv.toml`:

```toml
[binary_overrides]
samtools = "samtools"   # this tool wins when multiple tools expose samtools
```

---

## Discovery: `bv search` and the registry website

```sh
# Search for tools by name, description, or I/O type
bv search blast
bv search fasta                # find tools that accept FASTA input
bv search --tier core          # only core-tier tools
bv search colabfold --tier all # include experimental tier

# Browse the full registry with filters at:
# https://mlberkeley.github.io/bv-registry/
```

Each tool in the registry carries a `tier`:

| Tier | Meaning |
|------|---------|
| `core` | Typed I/O complete, from a recognized publisher, actively maintained |
| `community` | Typed I/O present, basic checks pass |
| `experimental` | Basic checks pass; may lack typed I/O. Hidden by default. |

---

## Typed I/O and tool introspection

Manifests declare typed inputs and outputs from the `bv-types` vocabulary. This powers composition, validation, and integrations.

```sh
# Human-readable schema
bv show blast

# Stable JSON output (for scripting)
bv show blast --format json

# MCP tool descriptor (for Claude and other AI assistants)
bv show blast --format mcp

# JSON Schema for the tool's inputs
bv show blast --format json-schema
```

Example MCP output:
```json
{
  "name": "blast",
  "description": "BLAST+ Basic Local Alignment Search Tool",
  "inputSchema": {
    "type": "object",
    "properties": {
      "query": { "type": "string", "description": "FASTA file path" },
      "db":    { "type": "string", "description": "BLAST database directory" }
    }
  }
}
```

---

## Backend selection: Docker and Apptainer

`bv` auto-detects the available runtime. Docker is preferred on laptops; Apptainer is preferred on HPC clusters where Docker is unavailable.

```sh
bv doctor                         # shows which runtimes are available

bv add blast --backend apptainer  # pull as a SIF file instead of a Docker image
bv run blastn --backend apptainer -version
bv sync       --backend apptainer
```

Pin the backend in `bv.toml`:

```toml
[runtime]
backend = "apptainer"             # docker | apptainer | auto (default)
```

Or use the `BV_BACKEND` environment variable:

```sh
export BV_BACKEND=apptainer
bv add blast && bv run blastn -version
```

GPU support works on both backends:

| Backend | GPU flag |
|---------|----------|
| Docker | `--gpus all` (nvidia-container-toolkit required) |
| Apptainer | `--nv` (uses host NVIDIA libraries) |

The manifest declares the GPU requirement; the runtime handles the flag automatically.

### Cache mounts

Apptainer runs containers with a read-only root filesystem, so any tool that downloads model weights or scratches to disk inside the image will fail (e.g. ColabFold writing to `/cache/colabfold`). bv binds writable host directories into the container for paths that the tool needs to write. The set of paths is resolved in three layers:

1. **Tool manifest** (`cache_paths` in the registry entry) — the tool author's authoritative list. ColabFold's manifest declares `cache_paths = ["/cache/colabfold"]`.
2. **User overrides** (`[[cache]]` in `bv.toml`) — point any container path at a different host directory (e.g. a shared NFS cache).
3. **Apptainer fallbacks** — for tools whose manifest hasn't declared any cache paths yet, bv auto-binds the well-known `/cache` and `/root/.cache` so common bioconda images don't fail outright.

Each container path defaults to a host directory under `~/.cache/bv/<tool>/`. Docker skips the apptainer fallbacks (writable upper layer covers the same need); manifest and user entries apply on both backends.

User overrides in `bv.toml`:

```toml
# applies to every tool; {tool} is replaced with the tool id
[[cache]]
match = "*"
container_path = "/cache"
host_path = "~/.cache/bv/{tool}"

# tool-specific: redirect colabfold weights to a shared cache
[[cache]]
match = "colabfold"
container_path = "/cache/colabfold"
host_path = "/srv/shared/colabfold-weights"
```

Tool authors declare what their image needs in the registry manifest:

```toml
[tool]
id = "colabfold"
# ...
cache_paths = ["/cache/colabfold"]
```

---

## Conformance testing

`bv conformance <tool>` pulls the tool's image and smoke-tests every binary it exposes. For each binary in `[tool.binaries]`, bv tries `--version`, `-version`, `--help`, `-h`, `-v`, `version` (in that order) and considers the binary alive if any of them exits 0. This catches broken images, missing shared libraries, and binaries that segfault on startup.

```sh
bv conformance blast
bv conformance hmmer --backend apptainer
```

Most tools need no extra config. For unusual binaries, add a `[tool.smoke]` block to the manifest:

```toml
[tool.smoke]
probes = { weird-tool = "--check" }   # pin a specific probe arg
skip   = ["server-daemon"]            # binaries with no safe probe arg
```

Conformance runs in CI on every PR to bv-registry. Today it's a smoke check only; running tools on canonical inputs and validating typed outputs is on the v2 roadmap.

---

## Publishing a tool

```sh
# From a local directory with a Dockerfile
bv publish ./my-tool

# From a GitHub repo (auto-clones it)
bv publish github:ohuelab/QuickVina2
bv publish github:user/repo@v2.1.0

# Non-interactive (reads bv-publish.toml)
bv publish . --non-interactive

# Build and inspect the manifest without pushing
bv publish . --no-push --no-pr
```

Interactive example with a new Python tool:

```sh
mkdir my-docking-tool && cd my-docking-tool
cat > requirements.txt << 'EOF'
rdkit
numpy
EOF

bv publish .
#  Detected  requirements.txt (Python)
#  Generated Dockerfile.bv
#
#  Tool name [my-docking-tool]:
#  Version [0.1.0]:
#  Description: Fast molecular docking
#
#  Inputs
#    Add input? [y/n]: y
#    Name: ligand
#    Type (? to list): pdb
#    Mount path [/workspace/ligand]: /workspace/ligand.pdb
#    Add another? [y/n]: n
#
#  Building image as ghcr.io/bv-registry/my-docking-tool:0.1.0 ...
#  PR opened: https://github.com/mlberkeley/bv-registry/pull/143
```

For automated publishing on every GitHub release, add to `.github/workflows/bv-publish.yml`:

```yaml
on:
  release:
    types: [published]
jobs:
  publish:
    uses: mlberkeley/bv/.github/workflows/bv-publish.yml@main
    with:
      tool-name: my-docking-tool
    secrets:
      GHCR_TOKEN: ${{ secrets.GHCR_TOKEN }}
      BV_REGISTRY_TOKEN: ${{ secrets.BV_REGISTRY_TOKEN }}
```

---

## Auto-ingestion from Bioconda: `bv-ingest`

`bv-ingest` scrapes Bioconda recipes and auto-generates draft manifests for any tool that has a BioContainers image. Binary names are extracted from recipe `test.commands` and `build.run_exports` and written into `[tool.binaries]` automatically.

```sh
# Ingest 10 tools from Bioconda (dry run)
bv-ingest run --dry-run --limit 10

# Ingest a specific tool
bv-ingest run --tool samtools

# Review manifests that need typed I/O
bv-ingest review --staging-dir ./staging

# Promote a reviewed manifest to the main registry
bv-ingest promote samtools 1.20
```

The nightly GitHub Actions workflow runs automatically and opens PRs to `bv-registry` for newly discovered tools.

---

## Reference data

For tools that need large reference databases:

```sh
bv add kraken2            # bv add prints what data the tool requires
bv data fetch pdbaa --yes # download (sizes range from MB to TB)
bv run kraken2 ...        # bv run auto-mounts the data
bv data list              # see what is cached locally
```

---

## Project files

**`bv.toml`** declares what you want:

```toml
[project]
name = "homology-project"

[registry]
url = "https://github.com/mlberkeley/bv-registry"

[runtime]
backend = "auto"          # optional; defaults to auto-detect

[[tools]]
id = "blast"
version = "=2.15.0"

[[tools]]
id = "hmmer"
```

**`bv.lock`** pins the exact state, including the binary routing index:

```toml
version = 1

[tools.blast]
tool_id = "blast"
version = "2.15.0"
image_reference = "ncbi/blast:2.15.0"
image_digest = "sha256:abc123..."
manifest_sha256 = "sha256:def456..."
resolved_at = "2024-01-15T10:00:00Z"
binaries = ["blastn", "blastp", "makeblastdb", "tblastn", "tblastx"]

[tools.hmmer]
tool_id = "hmmer"
version = "3.3.2"
image_reference = "quay.io/biocontainers/hmmer:3.3.2--h87f3376_2"
image_digest = "sha256:789abc..."
binaries = ["hmmbuild", "hmmsearch", "hmmscan", "jackhmmer", "phmmer"]

[binary_index]
blastn = "blast"
blastp = "blast"
makeblastdb = "blast"
hmmbuild = "hmmer"
hmmsearch = "hmmer"
hmmscan = "hmmer"
```

Both files belong in version control. `bv run` always uses the pinned digest. `.bv/` (the generated shim directory) is gitignored automatically.

---

## Reproducibility in CI

```yaml
- run: bv sync --frozen    # fails if bv.toml and bv.lock are inconsistent
- run: bv lock --check     # fails if bv.lock would change
- run: bv exec snakemake --cores 4
```

---

## Commands

| Command | Description |
|---------|-------------|
| `bv add <tool>[@ver]` | Add tools and pull their images |
| `bv remove <tool>` | Remove a tool |
| `bv run <binary|tool> [<args>]` | Run a binary or tool in its container |
| `bv exec <command>` | Run a command with all project binaries on PATH |
| `bv shell [--shell <sh>]` | Start an interactive subshell with binaries on PATH |
| `bv list` | Show installed tools with tier, digest, and size |
| `bv list --binaries` | Show the binary routing table |
| `bv search <query>` | Search the registry (text, type, tier filters) |
| `bv show <tool>` | Show typed I/O schema and metadata |
| `bv info <tool>` | Show lockfile-level detail |
| `bv lock [--check]` | Regenerate bv.lock; `--check` exits 1 if anything changed |
| `bv sync [--frozen]` | Pull all locked images and regenerate shims |
| `bv conformance <tool>` | Run the conformance test suite for a tool |
| `bv publish <source>` | Build and publish a tool to bv-registry |
| `bv data fetch <dataset>` | Download a reference dataset |
| `bv data list` | List locally cached datasets |
| `bv doctor` | Check runtimes, hardware, cache, and project state |

---

## The registry

Tools live in [mlberkeley/bv-registry](https://github.com/mlberkeley/bv-registry), a plain git repo of TOML manifests:

```
bv-registry/
  tools/
    blast/2.14.0.toml   2.15.0.toml
    hmmer/3.3.2.toml
    mmseqs2/17.0.0.toml
    colabfold/1.6.0.toml
    proteinmpnn/1.0.1.toml
  data/
    pdbaa/2024_01.toml
  index.json             # generated search index
```

Browse and filter at **https://mlberkeley.github.io/bv-registry/**

A full manifest:

```toml
[tool]
id = "blast"
version = "2.15.0"
description = "BLAST+ Basic Local Alignment Search Tool"
homepage = "https://blast.ncbi.nlm.nih.gov/Blast.cgi"
license = "Public Domain"
tier = "core"
maintainers = ["github:ncbi"]

[tool.image]
backend = "docker"
reference = "ncbi/blast:2.15.0"

[tool.hardware]
cpu_cores = 4
ram_gb = 8.0
disk_gb = 2.0

[[tool.inputs]]
name = "query"
type = "fasta"
cardinality = "one"
description = "Query sequences in FASTA format"

[[tool.outputs]]
name = "output"
type = "blast_tab"
cardinality = "one"
description = "Tabular alignment results (outfmt 6)"

[tool.entrypoint]
command = "blastn"
args_template = "-query {query} -db {db} -out {output} -num_threads {cpu_cores}"

[tool.binaries]
exposed = [
  "blastn", "blastp", "tblastn", "tblastx",
  "makeblastdb", "blastdbcmd", "blastdb_aliastool",
]

```

The `[tool.smoke]` block is optional and only needed for unusual binaries. Most manifests omit it.

### Multi-script tools (subcommands)

ML repos typically bundle several entry scripts (`train.py`, `sample.py`, `eval.py`). Declare these as `[tool.subcommands]` instead of cramming them through one `[tool.entrypoint]`:

```toml
[tool]
id = "genie2"
version = "1.0.0"

[tool.image]
backend = "docker"
reference = "ghcr.io/aqlaboratory/genie2:1.0.0"

[tool.hardware]
cpu_cores = 4
ram_gb = 16.0

[tool.hardware.gpu]
required = true
min_vram_gb = 16
cuda_version = "12.1"

[tool.subcommands]
train                = ["python", "genie/train.py"]
sample_unconditional = ["python", "genie/sample_unconditional.py"]
sample_scaffold      = ["python", "genie/sample_scaffold.py"]
```

Use:

```sh
bv run genie2 sample_unconditional --name base --epoch 40 --scale 0.6 --outdir outputs
bv run genie2 train --devices 1 --num_nodes 1 --config runs/example/configuration
bv run genie2                                 # no args -> prints subcommand list
```

Subcommand names stay namespaced under the tool id, so generic names (`train`, `eval`) don't collide across tools. They are not added to PATH and don't appear in `bv list --binaries` (use `[tool.binaries]` for that). A manifest must declare `[tool.entrypoint]`, `[tool.subcommands]`, or both.

The default registry is used automatically. Override with `--registry <url>` or `BV_REGISTRY=<url>` for private registries.

---

## Workspace layout

| Crate | Role |
|---|---|
| `bv-cli` | Binary, clap CLI, command implementations |
| `bv-core` | Manifest/lockfile types, cache layout, errors |
| `bv-runtime` | `ContainerRuntime` trait + Docker implementation |
| `bv-runtime-apptainer` | Apptainer/Singularity implementation |
| `bv-index` | `IndexBackend` trait + Git registry implementation |
| `bv-types` | Bioinformatics type vocabulary (20 types) |
| `bv-conformance` | Conformance test runner for registry manifests |

---

## Development

```sh
git clone https://github.com/mlberkeley/bv
cd bv
cargo build
cargo test
cargo test --test integration -- --include-ignored   # needs Docker or Apptainer
```

See [CONTRIBUTING.md](CONTRIBUTING.md) for contribution guidelines.

---

## License

Apache-2.0. See [LICENSE](LICENSE).