bio-rs
bio-rs turns biological sequences into validated, model-ready inputs for bio-AI workflows.
FASTA -> validated protein sequence -> token ids -> model-ready JSON
Status: v0.9.5 — CLI and JSON contract freeze.
Why bio-rs?
Most bio-AI models are born in Python, but the tooling around them often needs to run somewhere else:
- local CLIs
- CI pipelines
- servers
- browsers
- agents
bio-rs focuses on the boring but important layer before inference:
- parse biological sequence input
- validate it with structured diagnostics
- tokenize it into stable IDs
- emit machine-readable JSON contracts
- keep preprocessing reproducible outside notebooks
The goal is not to replace Python research workflows.
The goal is to make the input layer around bio-AI models faster, more portable, and easier to trust.
Quickstart
Tokenize a FASTA file:
Pipe FASTA through stdin:
|
Validate FASTA:
Verify package fixture outputs:
Build model-ready input records:
Proof
bio-rs keeps performance claims tied to reproducible in-repo benchmarks.
Latest recorded FASTA benchmark baseline:
| Dataset | Matched workload | bio-rs core mean | Biopython mean | bio-rs speedup |
|---|---|---|---|---|
| Human proteome | Parse + validation | 0.192s | 0.495s | 2.57x |
| Human proteome | Parse + tokenization | 0.182s | 0.499s | 2.74x |
| 100MB+ FASTA | Parse + validation | 1.687s | 4.490s | 2.66x |
| 100MB+ FASTA | Parse + tokenization | 1.625s | 4.488s | 2.76x |
Benchmark details:
- Datasets:
- UniProt human reference proteome (
UP000005640,9606) - 100MB+ large FASTA generated by repeating the same real proteome to isolate large-input throughput
- UniProt human reference proteome (
- Matched workloads:
- pure parse
- parse plus validation
- parse plus tokenization
- Current best recorded raw throughput:
- human proteome parse + tokenization:
62.9M residues/s,71.9 MB/s - 100MB+ FASTA parse + tokenization:
63.4M residues/s,72.5 MB/s
- human proteome parse + tokenization:
- Benchmark doc: benchmarks/fasta_vs_biopython.md
- Benchmark script: scripts/benchmark_fasta_vs_biopython.py
This benchmark measures biors-core directly and excludes CLI startup and JSON
serialization overhead. It is still workload-specific, not a broad claim that
bio-rs is faster than Biopython across every FASTA workload or researcher input
shape.
What works today
biors-core provides the Rust engine and data contracts.
biors provides the CLI surface.
Current v0.9.5 capabilities:
- FASTA parsing and normalization
- FASTA validation with line and record-index diagnostics
- protein-20 tokenization
- positional token alignment preserved with explicit unknown-token IDs for unresolved residues
- residue warning/error reporting
- model-ready input records
- attention masks
- padding/truncation policy
model-inputCLI output- model-input safety checks for unresolved residues
- package manifest inspect/validate
- typed package manifest enums for schema version, model format, runtime target, and tensor dtypes
- runtime bridge planning reports
- manifest-relative asset validation
- SHA-256 package and fixture checksum verification
- package fixture verification from observed artifact paths
- JSON success/error envelopes
CLI examples
Inspect FASTA records:
Tokenize FASTA records:
Tokenize a multi-record FASTA file:
Validate FASTA records:
Emit structured JSON errors:
|
Build model-ready input records:
Inspect a package manifest:
Validate a package manifest:
Plan a runtime bridge from a package manifest:
Verify package fixture observations:
package verify expects the observations file to point at observed output artifact paths:
JSON contracts
Success output uses a stable envelope shape:
FASTA-backed commands keep input_hash in the legacy fnv1a64: format for backward compatibility. Package artifacts and fixture hashes use sha256: in manifests and verification reports.
--json error mode emits structured errors:
Tokenization output is record-oriented:
Public contract docs:
Release history
Delivered:
0.6.0: package manifest inspect/validate0.7.0: runtime bridge planning withpackage bridge0.8.0: fixture verification withpackage verify0.9.5: core-throughput benchmark harness, matched-workload benchmark refresh, workflow/cache tightening, and git-hook install helper0.9.4: tokenizer positional alignment preservation, FASTA single-pass tokenization/validation path, typed package manifest enums, and benchmark refresh0.9.3: release workflow fix for automatic GitHub Release creation after crates publish0.9.2: model-input safety hardening for unresolved residues and automated GitHub Release creation0.9.1: model-input CLI, checksum-backed package validation, benchmark refresh, and contract hardening0.9.0: CLI and JSON contract freeze baseline0.8.1: documentation, contribution guide, and benchmark baseline hardening
Next:
1.0.0: stable public contracts and runtime-facing APIs after enough real-world package validation
Not yet
These are roadmap directions, not current capabilities:
- hosted web workflows
- Python bindings
- model inference backends
- package registry or plugin ecosystem
- general-purpose chemistry tooling
- structure tooling
- no-code or low-code workflows
Development
Run checks:
The check suite runs:
cargo fmt- Rust checks
biors-corewasm32-unknown-unknownbuild check- tests
cargo clippywith warnings denied
Reproduce the FASTA benchmark:
Run the Rust library example:
Workspace
packages/
rust/
biors/ CLI
biors-core/ Core engine + contracts
schemas/
cli-error.v0.json
cli-success.v0.json
fasta-validation-output.v0.json
inspect-output.v0.json
model-input-output.v0.json
package-bridge-output.v0.json
package-inspect-output.v0.json
package-manifest.v0.json
package-validation-report.v0.json
package-verify-output.v0.json
tokenize-output.v0.json
examples/
protein.fasta
multi.fasta
protein-package/
models/
manifest.json
observations.json
fixtures/
observed/
tokenizers/
vocabs/
Protein-20 alphabet
A C D E F G H I K L M N P Q R S T V W Y
Token IDs follow that order, starting at 0.
Contributing
See CONTRIBUTING.md for local setup, checks, and PR expectations.
License
Dual licensed under MIT OR Apache-2.0.