bio-rs
bio-rs turns biological sequences into validated, model-ready inputs for bio-AI workflows.
FASTA -> validated protein sequence -> token ids -> model-ready JSON
Status: v0.9.2 — CLI and JSON contract freeze.
Why bio-rs?
Most bio-AI models are born in Python, but the tooling around them often needs to run somewhere else:
- local CLIs
- CI pipelines
- servers
- browsers
- agents
bio-rs focuses on the boring but important layer before inference:
- parse biological sequence input
- validate it with structured diagnostics
- tokenize it into stable IDs
- emit machine-readable JSON contracts
- keep preprocessing reproducible outside notebooks
The goal is not to replace Python research workflows.
The goal is to make the input layer around bio-AI models faster, more portable, and easier to trust.
Quickstart
Tokenize a FASTA file:
Pipe FASTA through stdin:
|
Validate FASTA:
Verify package fixture outputs:
Build model-ready input records:
Proof
bio-rs keeps performance claims tied to reproducible in-repo benchmarks.
Latest recorded FASTA benchmark baseline:
| Workflow | Mean time |
|---|---|
biors tokenize parse + tokenize + full JSON output |
0.385s |
| Biopython parse + protein-20 token/count loop | 0.457s |
| Biopython parse only | 0.056s |
Benchmark details:
- Dataset: UniProt human reference proteome
- Proteome ID:
UP000005640 - Taxonomy ID:
9606 - Shape: 20,659 records, 11,456,702 residues
- Benchmark doc: benchmarks/fasta_vs_biopython.md
- Benchmark script: scripts/benchmark_fasta_vs_biopython.py
This is a workload-specific reference-proteome baseline, not a broad claim that bio-rs is faster than Biopython across all FASTA workloads or all researcher input shapes.
What works today
biors-core provides the Rust engine and data contracts.
biors provides the CLI surface.
Current v0.9.2 capabilities:
- FASTA parsing and normalization
- FASTA validation with line and record-index diagnostics
- protein-20 tokenization
- residue warning/error reporting
- model-ready input records
- attention masks
- padding/truncation policy
model-inputCLI output- model-input safety checks for unresolved residues
- package manifest inspect/validate
- runtime bridge planning reports
- manifest-relative asset validation
- SHA-256 package and fixture checksum verification
- package fixture verification from observed artifact paths
- JSON success/error envelopes
CLI examples
Inspect FASTA records:
Tokenize FASTA records:
Tokenize a multi-record FASTA file:
Validate FASTA records:
Emit structured JSON errors:
|
Build model-ready input records:
Inspect a package manifest:
Validate a package manifest:
Plan a runtime bridge from a package manifest:
Verify package fixture observations:
package verify expects the observations file to point at observed output artifact paths:
JSON contracts
Success output uses a stable envelope shape:
FASTA-backed commands keep input_hash in the legacy fnv1a64: format for backward compatibility. Package artifacts and fixture hashes use sha256: in manifests and verification reports.
--json error mode emits structured errors:
Tokenization output is record-oriented:
Public contract docs:
Release history
Delivered:
0.6.0: package manifest inspect/validate0.7.0: runtime bridge planning withpackage bridge0.8.0: fixture verification withpackage verify0.8.1: documentation, contribution guide, and benchmark baseline hardening0.9.0: CLI and JSON contract freeze baseline0.9.1: model-input CLI, checksum-backed package validation, benchmark refresh, and contract hardening0.9.2: model-input safety hardening for unresolved residues and automated GitHub Release creation
Next:
1.0.0: stable public contracts and runtime-facing APIs after enough real-world package validation
Not yet
These are roadmap directions, not current capabilities:
- hosted web workflows
- Python bindings
- model inference backends
- package registry or plugin ecosystem
- general-purpose chemistry tooling
- structure tooling
- no-code or low-code workflows
Development
Run checks:
The check suite runs:
cargo fmt- Rust checks
biors-corewasm32-unknown-unknownbuild check- tests
cargo clippywith warnings denied
Reproduce the FASTA benchmark:
Run the Rust library example:
Workspace
packages/
rust/
biors/ CLI
biors-core/ Core engine + contracts
schemas/
cli-error.v0.json
cli-success.v0.json
fasta-validation-output.v0.json
inspect-output.v0.json
model-input-output.v0.json
package-bridge-output.v0.json
package-inspect-output.v0.json
package-manifest.v0.json
package-validation-report.v0.json
package-verify-output.v0.json
tokenize-output.v0.json
examples/
protein.fasta
multi.fasta
protein-package/
models/
manifest.json
observations.json
fixtures/
observed/
tokenizers/
vocabs/
Protein-20 alphabet
A C D E F G H I K L M N P Q R S T V W Y
Token IDs follow that order, starting at 0.
Contributing
See CONTRIBUTING.md for local setup, checks, and PR expectations.
License
Dual licensed under MIT OR Apache-2.0.