biors 0.37.0

Command-line tools for bio-rs biological AI model input workflows.
# bio-rs

[![CI](https://github.com/bio-rs/bio-rs/workflows/CI/badge.svg)](https://github.com/bio-rs/bio-rs/actions)
[![Release](https://github.com/bio-rs/bio-rs/actions/workflows/release.yml/badge.svg)](https://github.com/bio-rs/bio-rs/actions/workflows/release.yml)
[![Benchmark](https://img.shields.io/badge/benchmark-UniProt%20FASTA-blue)](benchmarks/fasta_vs_biopython.md)
[![Contracts](https://img.shields.io/badge/contracts-JSON%20v0-blue)](docs/public-contract-1.0-candidates.md)
[![License: MIT/Apache-2.0](https://img.shields.io/badge/License-MIT%2FApache--2.0-blue.svg)](LICENSE-MIT)

bio-rs turns biological sequences into validated, model-ready inputs for bio-AI workflows.

```txt
FASTA -> validated protein/DNA/RNA sequence -> protein token ids -> model-ready JSON
```

> Status: pre-1.0 CLI and JSON contract stabilization.

## Why bio-rs?

Most bio-AI models are born in Python, but the tooling around them often needs to run somewhere else:

- local CLIs
- CI pipelines
- servers
- browsers
- agents

bio-rs focuses on the boring but important layer before inference:

- parse biological sequence input
- validate it with structured diagnostics
- tokenize it into stable IDs
- emit machine-readable JSON contracts
- keep preprocessing reproducible outside notebooks

The goal is not to replace Python research workflows.

The goal is to make the input layer around bio-AI models faster, more portable, and easier to trust.

## Quickstart

```bash
cargo install biors --version 0.37.0
biors tokenize examples/protein.fasta
biors workflow --max-length 8 examples/protein.fasta
biors batch validate --kind auto examples/
biors tokenizer inspect --profile protein-20-special
biors dataset inspect --source uniprot --version 2026_02 --split train examples/
```

Full commands, demos, and install options: [docs/quickstart.md](docs/quickstart.md)

## Proof

bio-rs keeps performance claims tied to reproducible in-repo benchmarks.

Latest recorded FASTA benchmark baseline:

| Dataset | Matched workload | bio-rs core mean | Biopython mean | bio-rs speedup |
|---|---|---:|---:|---:|
| Human proteome | Parse + validation | **0.036s** | 0.584s | **16.09x** |
| Human proteome | Parse + tokenization | **0.061s** | 0.587s | **9.68x** |
| 100MB+ FASTA | Parse + validation | **0.294s** | 3.994s | **13.59x** |
| 100MB+ FASTA | Parse + tokenization | **0.492s** | 4.040s | **8.22x** |
| Many short records | Parse + validation | **0.007s** | 0.204s | **28.35x** |
| Many short records | Parse + tokenization | **0.010s** | 0.205s | **20.54x** |
| Single long sequence | Parse + validation | **0.005s** | 0.176s | **34.48x** |
| Single long sequence | Parse + tokenization | **0.007s** | 0.177s | **26.67x** |

Benchmark details:

- Datasets:
  - UniProt human reference proteome (`UP000005640`, `9606`)
  - 100MB+ large FASTA generated by repeating the same real proteome to isolate large-input throughput
  - 20,000 short 48-residue records generated from the same proteome residue stream
  - one 960,000-residue sequence generated from the same proteome residue stream
- Matched workloads:
  - pure parse
  - parse plus validation
  - parse plus tokenization
- Current best recorded raw throughput:
  - human proteome parse + validation: `315.4M residues/s`, `360.6 MB/s`
  - 100MB+ FASTA parse + validation: `350.8M residues/s`, `401.1 MB/s`
  - human proteome parse + tokenization: `189.0M residues/s`, `216.1 MB/s`
  - 100MB+ FASTA parse + tokenization: `209.7M residues/s`, `239.8 MB/s`
- Benchmark doc: [benchmarks/fasta_vs_biopython.md]benchmarks/fasta_vs_biopython.md
- Benchmark script: [scripts/benchmark_fasta_vs_biopython.py]scripts/benchmark_fasta_vs_biopython.py

This benchmark measures `biors-core` directly and excludes CLI startup and JSON
serialization overhead. It is still workload-specific, not a broad claim that
bio-rs is faster than Biopython across every FASTA workload or researcher input
shape.

## What works today

`biors-core` provides the Rust engine and data contracts.

`biors` provides the CLI surface.

Current capabilities:

- FASTA parsing and normalization
- shared FASTA parser/tokenizer scanner with an ASCII fast path and Unicode fallback
- buffered reader APIs for FASTA parse/validate/tokenize paths
- FASTA validation with line and record-index diagnostics
- FASTA record identifier validation
- protein-20 tokenization
- `protein-20-special` tokenization with explicit UNK/PAD/CLS/SEP/MASK policy
- tokenizer JSON config loading
- tokenizer inspection JSON output
- JSON vocab loading for tokenizer contracts
- positional token alignment preserved with explicit unknown-token IDs for unresolved residues
- residue warning/error reporting
- model-ready input records
- attention masks
- padding/truncation policy
- `model-input` CLI output
- `workflow` CLI output that combines validation, tokenization, model input,
  readiness issues, and reproducibility provenance
- workflow provenance hashes for tokenizer vocabulary and output-content
  reproducibility
- `diff` CLI output for canonical JSON/raw output comparison with SHA-256
  hashes and first-difference metadata
- `pipeline` CLI output for no-config validate -> tokenize -> export workflow
  composition
- pipeline lockfile generation for config-driven workflows with package/model
  and runtime provenance pins
- `debug` CLI output for sequence -> token -> model-input step inspection and
  compact residue error visualization
- `batch validate` for multiple files, recursive directory inputs, quoted glob
  inputs, empty-glob errors, and memory-bounded validation summaries
- `dataset inspect` for shared FASTA file/directory/glob input resolution
  before validation or pipeline execution, with dataset descriptors, sample
  mapping, dataset hashes, and file-level SHA-256 provenance
- `cache inspect` and guarded `cache clean` for the local artifact store policy
  used by package and dataset workflows
- `doctor` CLI diagnostics for platform, toolchain, WASM target, and committed fixture readiness
- model-input safety checks for unresolved residues
- explicit checked and unchecked model-input builders
- writer-based CLI success JSON serialization to reduce peak allocations for large outputs
- package manifest inspect/validate
- package manifest migration planning, schema compatibility checks, and canonical diffs
- package manifest v0 to v1 conversion with explicit research metadata input
- Hugging Face tokenizer config conversion to bio-rs tokenizer config
- Python project to bio-rs package skeleton generation with manifest,
  tokenizer, pipeline, fixture, docs, and checksum output
- typed package validation issue codes
- typed package manifest enums for schema version, model format, runtime target, and tensor dtypes
- runtime bridge planning reports
- manifest-relative asset validation
- package preprocessing steps can reference checked pipeline config artifacts
- package path escape rejection for manifest and observation assets
- SHA-256 package and fixture checksum verification
- package fixture verification from observed artifact paths
- structured package fixture mismatch issue codes and first-difference reports
- committed FASTA, tokenizer, manifest, and verification fixtures
- draft model-input contract and reference Python preprocessing parity fixtures
- JSON success/error envelopes

## Documentation

- [Quickstart]docs/quickstart.md — install, first commands, demos
- [Launch demo]docs/demo.md — researcher-facing demo workflow
- [Installation and distribution]docs/install.md — cargo, binaries, completions
- [CLI contract]docs/cli-contract.md — commands, JSON envelopes, exit codes
- [Package format]docs/package-format.md — manifest layout and research metadata
- [Package conversion]docs/package-conversion.md — HF/Python project conversion path
- [Pipeline config]docs/pipeline-config.md — config-driven static preprocessing workflows
- [Dataset inputs and artifact store]docs/dataset-inputs.md
- [Error code registry]docs/error-codes.md
- [Reliability and input safety]docs/reliability.md
- [Python interop]docs/python-interop.md
- [WASM readiness]docs/wasm-readiness.md
- [1.0 contract candidates]docs/public-contract-1.0-candidates.md
- [Versioning policy]docs/versioning.md
- [Schema versioning]docs/schema-versioning.md
- [Final release checklist]docs/final-release-checklist.md
- [Changelog]CHANGELOG.md
- [JSON schemas]schemas
- [Citation metadata]CITATION.cff

## Not yet

These are roadmap directions, not current capabilities:

- hosted web workflows
- Python bindings
- model inference backends
- package registry or plugin ecosystem
- general-purpose chemistry tooling
- structure tooling
- no-code or low-code workflows

## Development

Run checks:

```bash
scripts/check.sh
```

Run the faster local commit gate:

```bash
scripts/check-fast.sh
```

The check suite runs:

- `cargo fmt`
- shell and Python syntax checks for repo scripts
- benchmark Markdown regeneration check
- release workflow publish-order invariant check
- Rust checks
- `biors-core` `wasm32-unknown-unknown` build check
- tests
- `cargo clippy` with warnings denied

Reproduce the FASTA benchmark:

```bash
cargo build --release -p biors-core --example benchmark_fasta
python3 -m venv .venv-bench
. .venv-bench/bin/activate
pip install biopython
python scripts/benchmark_fasta_vs_biopython.py
cat benchmarks/fasta_vs_biopython.json
```

The benchmark script updates both `benchmarks/fasta_vs_biopython.json` and
`benchmarks/fasta_vs_biopython.md`. `scripts/check-benchmark-docs.sh` verifies
that the Markdown report still matches the JSON artifact.

Compare two benchmark artifacts:

```bash
python scripts/compare-benchmark-artifacts.py before.json after.json
```

Run the Rust library example:

```bash
cargo run -p biors-core --example tokenize
```

## Workspace

```txt
packages/
  rust/
    biors/       CLI
    biors-core/  Core engine + contracts

schemas/
  batch-validation-output.v0.json
  cache-output.v0.json
  cli-error.v0.json
  cli-success.v0.json
  dataset-inspect-output.v0.json
  fasta-validation-output.v0.json
  inspect-output.v0.json
  model-input-output.v0.json
  output-diff.v0.json
  pipeline-output.v0.json
  pipeline-config.v0.json
  pipeline-lock.v0.json
  sequence-workflow-output.v0.json
  sequence-debug-output.v0.json
  package-bridge-output.v0.json
  package-compatibility-output.v0.json
  package-conversion-output.v0.json
  package-diff-output.v0.json
  package-inspect-output.v0.json
  package-manifest.v0.json
  package-manifest.v1.json
  package-migration-output.v0.json
  package-skeleton-output.v0.json
  package-validation-report.v0.json
  package-verify-output.v0.json
  tokenizer-conversion-output.v0.json
  tokenizer-inspect-output.v0.json
  tokenize-output.v0.json

examples/
  protein.fasta
  multi.fasta
  model-input-contract/
    protein-20-special.config.json
    protein-20-special.expected.json
    reference-python-parity.json
  python/
    esm_from_biors_json.py
    pandas_numpy_friendly.py
    protbert_from_biors_json.py
    reference_preprocess.py
  protein-package/
    models/
    docs/
    manifest.json
    observations.json
    fixtures/
    observed/
    tokenizers/
    vocabs/
    pipelines/
  pipeline/
    protein.toml
    protein.yaml
    protein.json
    pipeline.lock
```

## Protein-20 alphabet

```txt
A C D E F G H I K L M N P Q R S T V W Y
```

Token IDs follow that order, starting at `0`.

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for local setup, checks, and PR expectations.

## License

Dual licensed under MIT OR Apache-2.0. If you use bio-rs in research software
or publications, cite the repository and version via [CITATION.cff](CITATION.cff).