biors 0.47.9

Command-line tools for bio-rs biological AI model input workflows.
# bio-rs

[![CI](https://github.com/bio-rs/bio-rs/workflows/CI/badge.svg)](https://github.com/bio-rs/bio-rs/actions)
[![Release](https://github.com/bio-rs/bio-rs/actions/workflows/release.yml/badge.svg)](https://github.com/bio-rs/bio-rs/actions/workflows/release.yml)
[![Benchmark](https://img.shields.io/badge/benchmark-UniProt%20FASTA-blue)](benchmarks/fasta_vs_biopython.md)
[![Contracts](https://img.shields.io/badge/contracts-JSON%20v0-blue)](docs/cli-contract.md)
[![License: MIT/Apache-2.0](https://img.shields.io/badge/License-MIT%2FApache--2.0-blue.svg)](LICENSE-MIT)

bio-rs turns biological FASTA into validated and tokenized inputs for bio-AI workflows, with protein, DNA, and RNA model-ready workflows.

```txt
FASTA -> validated sequence -> token IDs -> model-ready JSON
```

DNA and RNA FASTA validation, tokenization, model-input generation, workflow generation, Python/WASM/MCP bindings, package artifact validation, and benchmark regression guards are supported through explicit nucleotide profiles. Package generation and Python/Hugging Face conversion remain protein-first; see the [sequence-kind support matrix](docs/sequence-kind-support.md) before making broad full-support claims.

> Status: pre-1.0 CLI and JSON contract stabilization.

## Why bio-rs?

Most bio-AI models are born in Python, but the tooling around them often needs to run somewhere else:

- local CLIs
- CI pipelines
- servers
- browsers
- agents

bio-rs focuses on the boring but important layer before inference:

- parse biological sequence input
- validate it with structured diagnostics
- tokenize it into stable IDs
- emit machine-readable JSON contracts
- keep preprocessing reproducible outside notebooks

The goal is not to replace Python research workflows.

The goal is to make the input layer around bio-AI models faster, more portable, and easier to trust.

## Quickstart

```bash
cargo install biors --version 0.47.9
biors tokenize examples/protein.fasta
biors workflow --max-length 8 examples/protein.fasta
biors batch validate --kind auto examples/
biors tokenizer inspect --profile protein-20-special
biors dataset inspect --source uniprot --version 2026_02 --split train examples/
```

Full commands, demos, and install options: [docs/quickstart.md](docs/quickstart.md)

## Proof

bio-rs keeps performance claims tied to reproducible in-repo benchmarks.

Current release posture: no current-version numeric throughput claim is made for
`0.47.9`. The committed benchmark artifact is historical performance evidence
from `biors-core v0.20.0`; rerun and commit a fresh artifact before using these
numbers as evidence for a later release.

The current code includes Criterion regression guards for fixed-length
model-input construction and selected backend smoke paths, but those
implementation changes are not represented by the historical FASTA table below.
The benchmark artifact records which promoted surfaces have committed numeric
coverage and which are explicit non-claims.

### Historical FASTA benchmark reference

Historical FASTA benchmark baseline (recorded on `biors-core v0.20.0`; not current-version performance evidence):

| Dataset | Matched workload | bio-rs core mean | Biopython mean | bio-rs speedup |
|---|---|---:|---:|---:|
| Human proteome | Parse + validation | **0.036s** | 0.584s | **16.09x** |
| Human proteome | Parse + tokenization | **0.061s** | 0.587s | **9.68x** |
| 100MB+ FASTA | Parse + validation | **0.294s** | 3.994s | **13.59x** |
| 100MB+ FASTA | Parse + tokenization | **0.492s** | 4.040s | **8.22x** |
| Many short records | Parse + validation | **0.007s** | 0.204s | **28.35x** |
| Many short records | Parse + tokenization | **0.010s** | 0.205s | **20.54x** |
| Single long sequence | Parse + validation | **0.005s** | 0.176s | **34.48x** |
| Single long sequence | Parse + tokenization | **0.007s** | 0.177s | **26.67x** |

Benchmark details:

- Datasets:
  - UniProt human reference proteome (`UP000005640`, `9606`)
  - 100MB+ large FASTA generated by repeating the same real proteome to isolate large-input throughput
  - 20,000 short 48-residue records generated from the same proteome residue stream
  - one 960,000-residue sequence generated from the same proteome residue stream
- Matched workloads:
  - pure parse
  - parse plus validation
  - parse plus tokenization
- Current best recorded raw throughput:
  - human proteome parse + validation: `315.4M residues/s`, `360.6 MB/s`
  - 100MB+ FASTA parse + validation: `350.8M residues/s`, `401.1 MB/s`
  - human proteome parse + tokenization: `189.0M residues/s`, `216.1 MB/s`
  - 100MB+ FASTA parse + tokenization: `209.7M residues/s`, `239.8 MB/s`
- Benchmark doc: [benchmarks/fasta_vs_biopython.md]benchmarks/fasta_vs_biopython.md
- Benchmark script: [scripts/benchmark_fasta_vs_biopython.py]scripts/benchmark_fasta_vs_biopython.py

This benchmark measures `biors-core` directly and excludes CLI startup and JSON
serialization overhead. It is still workload-specific, not a broad claim that
bio-rs is faster than Biopython across every FASTA workload or researcher input
shape. Until the artifact is refreshed for `0.47.9`, the numeric table above
remains a historical reference.

## What works today

`biors-core` provides the Rust engine and data contracts. `biors` provides the CLI surface.

### Sequence handling
- FASTA parsing and normalization with buffered reader APIs
- Protein/DNA/RNA validation with per-record kind detection (`--kind auto`)
- Line and record-index diagnostics with residue warning/error reporting

### Tokenization
- `protein-20` tokenization with stable IDs
- `protein-20-special` tokenization with UNK/PAD/CLS/SEP/MASK special tokens
- `dna-iupac` and `rna-iupac` tokenization with stable canonical base IDs
- `dna-iupac-special` and `rna-iupac-special` tokenization with UNK/PAD/CLS/SEP/MASK special tokens
- JSON tokenizer config loading and inspection
- Hugging Face tokenizer config conversion
- Positional token alignment preserved with explicit unknown-token IDs

### Model input
- `model-input` CLI: profile-aware `input_ids`, `attention_mask`, and truncation metadata for protein, DNA, and RNA token profiles
- `workflow` CLI: profile-aware validation → tokenization → model input with readiness issues and reproducibility provenance
- `pipeline` CLI: no-config validate → tokenize → export, or config-driven (TOML/YAML/JSON) workflows with lockfile generation
- `debug` CLI: step-by-step per-record inspection with compact residue markers
- Checked and unchecked model-input builders with safety checks for unresolved residues
- Python, WASM, MCP, package artifact validation, and regression benchmarks cover nucleotide model-ready workflows. Package skeleton/conversion helpers remain protein-first; see [Protein, DNA, and RNA support]docs/sequence-kind-support.md.

### Batch and dataset operations
- `batch validate`: multiple files, recursive directories, quoted globs
- `dataset inspect`: dataset descriptors, sample mapping, file SHA-256 provenance
- `cache inspect` and guarded `cache clean` for local artifact store

### Package management
- Manifest inspection, validation, and migration (v0 → v1)
- Schema compatibility checks and canonical diffs
- SHA-256 checksum verification and fixture verification
- Python project to bio-rs package skeleton conversion
- Runtime bridge planning reports, backend execution abstraction contracts, and
  guarded external-process backend adapters
- Optional Candle backend crate for CPU safetensors linear-probe inference
- Model artifact metadata and runtime/model compatibility checks in package
  bridge reports
- Transport-agnostic service interface contract for service hosts, without
  bundling a server runtime
- Typed validation issue codes and manifest enums

### External interfaces
- `biors-python`: PyO3 bindings for Python integration and notebook workflows
- `biors-wasm`: WebAssembly/JavaScript bindings with TypeScript definitions
- `biors-mcp-server`: local MCP server crate for agent-callable sequence tools
- `service contract`: offline JSON route/schema contract for caller-owned
  service hosts

### Utilities
- `diff`: canonical JSON/raw comparison with SHA-256 hashes
- `doctor`: core CLI, WASM, Python, package, release, and benchmark readiness
- `completions`: shell completion generation
- JSON success/error envelopes for all commands

## Documentation

- [Quickstart]docs/quickstart.md — install and first commands
- [Installation and distribution]docs/install.md — cargo, binaries, completions
- [CLI contract]docs/cli-contract.md — commands, JSON envelopes, exit codes
- [Package format]docs/package-format.md — manifest layout and research metadata
- [Package conversion]docs/package-conversion.md — HF/Python project conversion path
- [Candle backend]docs/candle-backend.md — optional Candle runtime crate
- [Service interface]docs/service-interface.md — service-host contract and runtime boundary
- [Protein, DNA, and RNA support]docs/sequence-kind-support.md — public support matrix by surface
- [Pipeline config]docs/pipeline-config.md — config-driven static preprocessing workflows
- [Error code registry]docs/error-codes.md
- [Rust API]docs/rust-api.md
- [Python API]docs/python-api.md
- [WASM API]docs/wasm-api.md
- [Versioning policy]docs/versioning.md
- [JSON schemas]schemas
- [Citation metadata]CITATION.cff

## Not yet

These are roadmap directions, not current capabilities:

- hosted web workflows
- pretrained model-specific inference backends
- package registry or plugin ecosystem
- general-purpose chemistry tooling
- structure tooling
- no-code or low-code workflows

## Development

Run checks:

```bash
scripts/check.sh
```

Run the faster local commit gate:

```bash
scripts/check-fast.sh
```

The check suite runs:

- `cargo fmt`
- shell and Python syntax checks for repo scripts
- benchmark Markdown regeneration check
- release workflow publish-order invariant check
- Rust checks
- `biors-core` `wasm32-unknown-unknown` build check
- tests
- `cargo clippy` with warnings denied

Reproduce the FASTA benchmark:

```bash
cargo build --release -p biors-core --example benchmark_fasta
python3 -m venv .venv-bench
. .venv-bench/bin/activate
pip install biopython
python scripts/benchmark_fasta_vs_biopython.py
cat benchmarks/fasta_vs_biopython.json
```

The benchmark script updates both `benchmarks/fasta_vs_biopython.json` and
`benchmarks/fasta_vs_biopython.md`. `scripts/check-benchmark-docs.sh` verifies
that the Markdown report still matches the JSON artifact.

Compare two benchmark artifacts:

```bash
python scripts/compare-benchmark-artifacts.py before.json after.json
```

Run the Rust library example:

```bash
cargo run -p biors-core --example tokenize
```

## Workspace

```txt
packages/
  rust/
    biors/                 CLI
    biors-backend-candle/  Optional Candle runtime backend
    biors-core/            Core engine + contracts
    biors-mcp-server/      Local MCP server
    biors-python/          PyO3 bindings
    biors-wasm/            WASM/JS bindings

schemas/
  batch-validation-output.v0.json
  cache-output.v0.json
  cli-error.v0.json
  cli-success.v0.json
  dataset-inspect-output.v0.json
  doctor-output.v0.json
  fasta-validation-output.v0.json
  inspect-output.v0.json
  model-input-output.v0.json
  output-diff.v0.json
  pipeline-config.v0.json
  pipeline-lock.v0.json
  pipeline-output.v0.json
  sequence-workflow-output.v0.json
  sequence-debug-output.v0.json
  service-interface-output.v0.json
  service-model-input-request.v0.json
  service-package-compatibility-request.v0.json
  service-package-request.v0.json
  service-sequence-inspect-request.v0.json
  service-sequence-tokenize-request.v0.json
  service-sequence-validate-request.v0.json
  package-bridge-output.v0.json
  package-compatibility-output.v0.json
  package-conversion-output.v0.json
  package-diff-output.v0.json
  package-inspect-output.v0.json
  package-manifest.v0.json
  package-manifest.v1.json
  package-migration-output.v0.json
  package-skeleton-output.v0.json
  package-validation-report.v0.json
  package-verify-output.v0.json
  tokenizer-conversion-output.v0.json
  tokenizer-inspect-output.v0.json
  tokenize-output.v0.json

examples/
  protein.fasta
  multi.fasta
  model-input-contract/
    protein.fasta
    protein-20-special.config.json
    protein-20-special.expected.json
    reference-python-parity.json
  python/
    esm_from_biors_json.py
    pandas_numpy_friendly.py
    protbert_from_biors_json.py
    reference_preprocess.py
  protein-package/
    models/
    docs/
    manifest.json
    observations.json
    fixtures/
    observed/
    tokenizers/
    vocabs/
    pipelines/
  pipeline/
    protein.toml
    protein.yaml
    protein.json
    pipeline.lock
```

## Protein-20 alphabet

```txt
A C D E F G H I K L M N P Q R S T V W Y
```

Token IDs follow that order, starting at `0`.

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for local setup, checks, and PR expectations.

## License

Dual licensed under MIT OR Apache-2.0. If you use bio-rs in research software
or publications, cite the repository and version via [CITATION.cff](CITATION.cff).