biors-0.47.9 is not a library.

bio-rs

bio-rs turns biological FASTA into validated and tokenized inputs for bio-AI workflows, with protein, DNA, and RNA model-ready workflows.

FASTA -> validated sequence -> token IDs -> model-ready JSON

DNA and RNA FASTA validation, tokenization, model-input generation, workflow generation, Python/WASM/MCP bindings, package artifact validation, and benchmark regression guards are supported through explicit nucleotide profiles. Package generation and Python/Hugging Face conversion remain protein-first; see the sequence-kind support matrix before making broad full-support claims.

Status: pre-1.0 CLI and JSON contract stabilization.

Why bio-rs?

Most bio-AI models are born in Python, but the tooling around them often needs to run somewhere else:

local CLIs
CI pipelines
servers
browsers
agents

bio-rs focuses on the boring but important layer before inference:

parse biological sequence input
validate it with structured diagnostics
tokenize it into stable IDs
emit machine-readable JSON contracts
keep preprocessing reproducible outside notebooks

The goal is not to replace Python research workflows.

The goal is to make the input layer around bio-AI models faster, more portable, and easier to trust.

Quickstart

cargo install biors --version 0.47.9
biors tokenize examples/protein.fasta
biors workflow --max-length 8 examples/protein.fasta
biors batch validate --kind auto examples/
biors tokenizer inspect --profile protein-20-special
biors dataset inspect --source uniprot --version 2026_02 --split train examples/

Full commands, demos, and install options: docs/quickstart.md

Proof

bio-rs keeps performance claims tied to reproducible in-repo benchmarks.

Current release posture: no current-version numeric throughput claim is made for 0.47.9. The committed benchmark artifact is historical performance evidence from biors-core v0.20.0; rerun and commit a fresh artifact before using these numbers as evidence for a later release.

The current code includes Criterion regression guards for fixed-length model-input construction and selected backend smoke paths, but those implementation changes are not represented by the historical FASTA table below. The benchmark artifact records which promoted surfaces have committed numeric coverage and which are explicit non-claims.

Historical FASTA benchmark reference

Historical FASTA benchmark baseline (recorded on biors-core v0.20.0; not current-version performance evidence):

Dataset	Matched workload	bio-rs core mean	Biopython mean	bio-rs speedup
Human proteome	Parse + validation	0.036s	0.584s	16.09x
Human proteome	Parse + tokenization	0.061s	0.587s	9.68x
100MB+ FASTA	Parse + validation	0.294s	3.994s	13.59x
100MB+ FASTA	Parse + tokenization	0.492s	4.040s	8.22x
Many short records	Parse + validation	0.007s	0.204s	28.35x
Many short records	Parse + tokenization	0.010s	0.205s	20.54x
Single long sequence	Parse + validation	0.005s	0.176s	34.48x
Single long sequence	Parse + tokenization	0.007s	0.177s	26.67x

Benchmark details:

Datasets:
- UniProt human reference proteome (UP000005640, 9606)
- 100MB+ large FASTA generated by repeating the same real proteome to isolate large-input throughput
- 20,000 short 48-residue records generated from the same proteome residue stream
- one 960,000-residue sequence generated from the same proteome residue stream
Matched workloads:
- pure parse
- parse plus validation
- parse plus tokenization
Current best recorded raw throughput:
- human proteome parse + validation: 315.4M residues/s, 360.6 MB/s
- 100MB+ FASTA parse + validation: 350.8M residues/s, 401.1 MB/s
- human proteome parse + tokenization: 189.0M residues/s, 216.1 MB/s
- 100MB+ FASTA parse + tokenization: 209.7M residues/s, 239.8 MB/s
Benchmark doc: benchmarks/fasta_vs_biopython.md
Benchmark script: scripts/benchmark_fasta_vs_biopython.py

This benchmark measures biors-core directly and excludes CLI startup and JSON serialization overhead. It is still workload-specific, not a broad claim that bio-rs is faster than Biopython across every FASTA workload or researcher input shape. Until the artifact is refreshed for 0.47.9, the numeric table above remains a historical reference.

What works today

biors-core provides the Rust engine and data contracts. biors provides the CLI surface.

Sequence handling

FASTA parsing and normalization with buffered reader APIs
Protein/DNA/RNA validation with per-record kind detection (--kind auto)
Line and record-index diagnostics with residue warning/error reporting

Tokenization

protein-20 tokenization with stable IDs
protein-20-special tokenization with UNK/PAD/CLS/SEP/MASK special tokens
dna-iupac and rna-iupac tokenization with stable canonical base IDs
dna-iupac-special and rna-iupac-special tokenization with UNK/PAD/CLS/SEP/MASK special tokens
JSON tokenizer config loading and inspection
Hugging Face tokenizer config conversion
Positional token alignment preserved with explicit unknown-token IDs

Model input

model-input CLI: profile-aware input_ids, attention_mask, and truncation metadata for protein, DNA, and RNA token profiles
workflow CLI: profile-aware validation → tokenization → model input with readiness issues and reproducibility provenance
pipeline CLI: no-config validate → tokenize → export, or config-driven (TOML/YAML/JSON) workflows with lockfile generation
debug CLI: step-by-step per-record inspection with compact residue markers
Checked and unchecked model-input builders with safety checks for unresolved residues
Python, WASM, MCP, package artifact validation, and regression benchmarks cover nucleotide model-ready workflows. Package skeleton/conversion helpers remain protein-first; see Protein, DNA, and RNA support.

Batch and dataset operations

batch validate: multiple files, recursive directories, quoted globs
dataset inspect: dataset descriptors, sample mapping, file SHA-256 provenance
cache inspect and guarded cache clean for local artifact store

Package management

Manifest inspection, validation, and migration (v0 → v1)
Schema compatibility checks and canonical diffs
SHA-256 checksum verification and fixture verification
Python project to bio-rs package skeleton conversion
Runtime bridge planning reports, backend execution abstraction contracts, and guarded external-process backend adapters
Optional Candle backend crate for CPU safetensors linear-probe inference
Model artifact metadata and runtime/model compatibility checks in package bridge reports
Transport-agnostic service interface contract for service hosts, without bundling a server runtime
Typed validation issue codes and manifest enums

External interfaces

biors-python: PyO3 bindings for Python integration and notebook workflows
biors-wasm: WebAssembly/JavaScript bindings with TypeScript definitions
biors-mcp-server: local MCP server crate for agent-callable sequence tools
service contract: offline JSON route/schema contract for caller-owned service hosts

Utilities

diff: canonical JSON/raw comparison with SHA-256 hashes
doctor: core CLI, WASM, Python, package, release, and benchmark readiness
completions: shell completion generation
JSON success/error envelopes for all commands

Documentation

Quickstart — install and first commands
Installation and distribution — cargo, binaries, completions
CLI contract — commands, JSON envelopes, exit codes
Package format — manifest layout and research metadata
Package conversion — HF/Python project conversion path
Candle backend — optional Candle runtime crate
Service interface — service-host contract and runtime boundary
Protein, DNA, and RNA support — public support matrix by surface
Pipeline config — config-driven static preprocessing workflows
Error code registry
Rust API
Python API
WASM API
Versioning policy
JSON schemas
Citation metadata

Not yet

These are roadmap directions, not current capabilities:

hosted web workflows
pretrained model-specific inference backends
package registry or plugin ecosystem
general-purpose chemistry tooling
structure tooling
no-code or low-code workflows

Development

Run checks:

scripts/check.sh

Run the faster local commit gate:

scripts/check-fast.sh

The check suite runs:

cargo fmt
shell and Python syntax checks for repo scripts
benchmark Markdown regeneration check
release workflow publish-order invariant check
Rust checks
biors-core wasm32-unknown-unknown build check
tests
cargo clippy with warnings denied

Reproduce the FASTA benchmark:

cargo build --release -p biors-core --example benchmark_fasta
python3 -m venv .venv-bench
. .venv-bench/bin/activate
pip install biopython
python scripts/benchmark_fasta_vs_biopython.py
cat benchmarks/fasta_vs_biopython.json

The benchmark script updates both benchmarks/fasta_vs_biopython.json and benchmarks/fasta_vs_biopython.md. scripts/check-benchmark-docs.sh verifies that the Markdown report still matches the JSON artifact.

Compare two benchmark artifacts:

python scripts/compare-benchmark-artifacts.py before.json after.json

Run the Rust library example:

cargo run -p biors-core --example tokenize

Workspace

packages/
  rust/
    biors/                 CLI
    biors-backend-candle/  Optional Candle runtime backend
    biors-core/            Core engine + contracts
    biors-mcp-server/      Local MCP server
    biors-python/          PyO3 bindings
    biors-wasm/            WASM/JS bindings

schemas/
  batch-validation-output.v0.json
  cache-output.v0.json
  cli-error.v0.json
  cli-success.v0.json
  dataset-inspect-output.v0.json
  doctor-output.v0.json
  fasta-validation-output.v0.json
  inspect-output.v0.json
  model-input-output.v0.json
  output-diff.v0.json
  pipeline-config.v0.json
  pipeline-lock.v0.json
  pipeline-output.v0.json
  sequence-workflow-output.v0.json
  sequence-debug-output.v0.json
  service-interface-output.v0.json
  service-model-input-request.v0.json
  service-package-compatibility-request.v0.json
  service-package-request.v0.json
  service-sequence-inspect-request.v0.json
  service-sequence-tokenize-request.v0.json
  service-sequence-validate-request.v0.json
  package-bridge-output.v0.json
  package-compatibility-output.v0.json
  package-conversion-output.v0.json
  package-diff-output.v0.json
  package-inspect-output.v0.json
  package-manifest.v0.json
  package-manifest.v1.json
  package-migration-output.v0.json
  package-skeleton-output.v0.json
  package-validation-report.v0.json
  package-verify-output.v0.json
  tokenizer-conversion-output.v0.json
  tokenizer-inspect-output.v0.json
  tokenize-output.v0.json

examples/
  protein.fasta
  multi.fasta
  model-input-contract/
    protein.fasta
    protein-20-special.config.json
    protein-20-special.expected.json
    reference-python-parity.json
  python/
    esm_from_biors_json.py
    pandas_numpy_friendly.py
    protbert_from_biors_json.py
    reference_preprocess.py
  protein-package/
    models/
    docs/
    manifest.json
    observations.json
    fixtures/
    observed/
    tokenizers/
    vocabs/
    pipelines/
  pipeline/
    protein.toml
    protein.yaml
    protein.json
    pipeline.lock

Protein-20 alphabet

A C D E F G H I K L M N P Q R S T V W Y

Token IDs follow that order, starting at 0.

Contributing

See CONTRIBUTING.md for local setup, checks, and PR expectations.

License

Dual licensed under MIT OR Apache-2.0. If you use bio-rs in research software or publications, cite the repository and version via CITATION.cff.

biors 0.47.9