bio-rs
bio-rs turns biological sequences into validated, model-ready inputs for bio-AI workflows.
FASTA -> validated protein/DNA/RNA sequence -> protein token ids -> model-ready JSON
Status: pre-1.0 CLI and JSON contract stabilization.
First 60 Seconds
From a checkout, the shortest credible demo is:
With the published CLI installed, the same flow is:
For a terminal recording or animated CLI capture, run:
That path validates the launch FASTA, tokenizes it, builds model-ready JSON, and verifies a portable package fixture without a separate demo app.
Why bio-rs?
Most bio-AI models are born in Python, but the tooling around them often needs to run somewhere else:
- local CLIs
- CI pipelines
- servers
- browsers
- agents
bio-rs focuses on the boring but important layer before inference:
- parse biological sequence input
- validate it with structured diagnostics
- tokenize it into stable IDs
- emit machine-readable JSON contracts
- keep preprocessing reproducible outside notebooks
The goal is not to replace Python research workflows.
The goal is to make the input layer around bio-AI models faster, more portable, and easier to trust.
Quickstart
Install the published CLI:
Tokenize a FASTA file:
Pipe FASTA through stdin:
|
Validate FASTA:
Validate mixed biological FASTA with per-record kind detection:
Run the launch demo dataset:
Generate a terminal-friendly CLI demo transcript:
Verify package fixture outputs:
Build model-ready input records:
Check local launch-readiness diagnostics:
Generate shell completions:
Proof
bio-rs keeps performance claims tied to reproducible in-repo benchmarks.
Latest recorded FASTA benchmark baseline:
| Dataset | Matched workload | bio-rs core mean | Biopython mean | bio-rs speedup |
|---|---|---|---|---|
| Human proteome | Parse + validation | 0.036s | 0.441s | 12.31x |
| Human proteome | Parse + tokenization | 0.061s | 0.441s | 7.27x |
| 100MB+ FASTA | Parse + validation | 0.291s | 3.972s | 13.67x |
| 100MB+ FASTA | Parse + tokenization | 0.507s | 4.002s | 7.90x |
| Many short records | Parse + validation | 0.007s | 0.057s | 8.25x |
| Many short records | Parse + tokenization | 0.010s | 0.057s | 5.54x |
| Single long sequence | Parse + validation | 0.006s | 0.034s | 5.95x |
| Single long sequence | Parse + tokenization | 0.007s | 0.035s | 4.75x |
Benchmark details:
- Datasets:
- UniProt human reference proteome (
UP000005640,9606) - 100MB+ large FASTA generated by repeating the same real proteome to isolate large-input throughput
- 20,000 short 48-residue records generated from the same proteome residue stream
- one 960,000-residue sequence generated from the same proteome residue stream
- UniProt human reference proteome (
- Matched workloads:
- pure parse
- parse plus validation
- parse plus tokenization
- Current best recorded raw throughput:
- human proteome parse + validation:
319.9M residues/s,365.8 MB/s - 100MB+ FASTA parse + validation:
354.7M residues/s,405.6 MB/s - human proteome parse + tokenization:
189.0M residues/s,216.1 MB/s - 100MB+ FASTA parse + tokenization:
203.4M residues/s,232.6 MB/s
- human proteome parse + validation:
- Benchmark doc: benchmarks/fasta_vs_biopython.md
- Benchmark script: scripts/benchmark_fasta_vs_biopython.py
This benchmark measures biors-core directly and excludes CLI startup and JSON
serialization overhead. It is still workload-specific, not a broad claim that
bio-rs is faster than Biopython across every FASTA workload or researcher input
shape.
What works today
biors-core provides the Rust engine and data contracts.
biors provides the CLI surface.
Current capabilities:
- FASTA parsing and normalization
- shared FASTA parser/tokenizer scanner with an ASCII fast path and Unicode fallback
- buffered reader APIs for FASTA parse/validate/tokenize paths
- FASTA validation with line and record-index diagnostics
- FASTA record identifier validation
- protein-20 tokenization
- JSON vocab loading for tokenizer contracts
- positional token alignment preserved with explicit unknown-token IDs for unresolved residues
- residue warning/error reporting
- model-ready input records
- attention masks
- padding/truncation policy
model-inputCLI outputdoctorCLI diagnostics for platform, toolchain, WASM target, and committed fixture readiness- model-input safety checks for unresolved residues
- explicit checked and unchecked model-input builders
- writer-based CLI success JSON serialization to reduce peak allocations for large outputs
- package manifest inspect/validate
- typed package validation issue codes
- typed package manifest enums for schema version, model format, runtime target, and tensor dtypes
- runtime bridge planning reports
- manifest-relative asset validation
- package path escape rejection for manifest and observation assets
- SHA-256 package and fixture checksum verification
- package fixture verification from observed artifact paths
- structured package fixture mismatch issue codes and first-difference reports
- committed FASTA, tokenizer, manifest, and verification fixtures
- JSON success/error envelopes
CLI examples
Inspect FASTA records:
Tokenize FASTA records:
Tokenize a multi-record FASTA file:
Validate FASTA records:
Emit structured JSON errors:
|
Build model-ready input records:
Inspect a package manifest:
Validate a package manifest:
Plan a runtime bridge from a package manifest:
Verify package fixture observations:
package verify expects the observations file to point at observed output artifact paths:
JSON contracts
Success output uses a stable envelope shape:
FASTA-backed commands keep input_hash in the legacy fnv1a64: format for backward compatibility. Package artifacts and fixture hashes use sha256: in manifests and verification reports.
--json error mode emits structured errors:
Tokenization output is record-oriented:
Public contract docs:
- Quickstart
- Launch demo
- Installation and distribution
- CLI contract
- Error code registry
- Reliability and input safety
- 1.0 contract candidates
- Versioning policy
- Final release checklist
- JSON schemas
- Citation metadata
Not yet
These are roadmap directions, not current capabilities:
- hosted web workflows
- Python bindings
- model inference backends
- package registry or plugin ecosystem
- general-purpose chemistry tooling
- structure tooling
- no-code or low-code workflows
Development
Run checks:
Run the faster local commit gate:
The check suite runs:
cargo fmt- shell and Python syntax checks for repo scripts
- benchmark Markdown regeneration check
- release workflow publish-order invariant check
- Rust checks
biors-corewasm32-unknown-unknownbuild check- tests
cargo clippywith warnings denied
Reproduce the FASTA benchmark:
The benchmark script updates both benchmarks/fasta_vs_biopython.json and
benchmarks/fasta_vs_biopython.md. scripts/check-benchmark-docs.sh verifies
that the Markdown report still matches the JSON artifact.
Compare two benchmark artifacts:
Run the Rust library example:
Workspace
packages/
rust/
biors/ CLI
biors-core/ Core engine + contracts
schemas/
cli-error.v0.json
cli-success.v0.json
fasta-validation-output.v0.json
inspect-output.v0.json
model-input-output.v0.json
package-bridge-output.v0.json
package-inspect-output.v0.json
package-manifest.v0.json
package-validation-report.v0.json
package-verify-output.v0.json
tokenize-output.v0.json
examples/
protein.fasta
multi.fasta
protein-package/
models/
manifest.json
observations.json
fixtures/
observed/
tokenizers/
vocabs/
Protein-20 alphabet
A C D E F G H I K L M N P Q R S T V W Y
Token IDs follow that order, starting at 0.
Contributing
See CONTRIBUTING.md for local setup, checks, and PR expectations.
License
Dual licensed under MIT OR Apache-2.0. If you use bio-rs in research software or publications, cite the repository and version via CITATION.cff.