bio-rs
Rust workspace for practical biological AI input tooling.
Status: v0.9.0 (workspace/package version in
Cargo.toml)
This repository focuses on functionality that is already implemented and testable today:
- FASTA parsing (
parse_fasta_records) - protein-20 tokenization (
tokenize_fasta_records) - FASTA validation (
biors fasta validate) - model-ready input shaping (
ModelInput) - package manifest inspect/validate/bridge planning
- fixture verification (
package verify) - frozen CLI JSON success/error envelope candidates
What exists in v0.9.0
Core (biors-core)
biors-core is the engine crate. It contains data contracts and pure Rust logic:
- FASTA record parsing and normalization
- FASTA validation with line and record-index diagnostics
- protein-20 tokenization and residue issue reporting
- model-ready input records with attention masks and padding/truncation policy
- package manifest structs + validation/inspection
- runtime bridge planning report generation
- fixture verification report generation
Use this crate when embedding bio-rs in Rust services, libraries, or tooling.
CLI (biors)
biors is the command-line surface built on top of biors-core.
- Reads FASTA/JSON files (or stdin for FASTA)
- Executes core workflows
- Emits machine-readable JSON success envelopes
- Supports JSON error mode with structured error codes
- Uses non-zero exit codes on invalid operations
Use this crate when you need shell-first workflows, scripting, or CI checks.
Release history and roadmap
Delivered
0.6.0: package manifest inspect/validate0.7.0: runtime bridge planning (package bridge)0.8.0: fixture verification (package verify)0.8.1: documentation, contribution guide, and benchmark baseline hardening0.9.0: CLI and JSON contract freeze candidates
Next (post-0.9)
1.0.0target: stable contracts and runtime-facing APIs after enough real-world package validation
0.7.0 capability notes are kept only as release history above; all "current" descriptions in this README are aligned to 0.9.0.
Not yet
These are roadmap directions, not current capabilities:
- hosted web workflows
- Python bindings
- model inference backends
- package registry or plugin ecosystem
- general-purpose chemistry or structure tooling
Quickstart
Inspect FASTA records:
Tokenize FASTA records:
Tokenize FASTA records from stdin:
|
Tokenize a multi-record FASTA file:
Validate FASTA records:
Emit structured JSON errors:
|
Inspect a portable model package manifest:
Validate a portable model package manifest:
Plan the portable runtime bridge for a package:
Verify package fixture observations:
Proof asset
This is the smallest reproducible package verification example in the repository.
Command:
Input:
- package manifest:
examples/protein-package/manifest.json - observed fixture map:
examples/protein-package/observations.json - expected output fixture:
examples/protein-package/fixtures/tiny.output.json
Output shape:
This proves that a portable package manifest can point to fixture inputs and
expected JSON outputs, and that biors can check observed outputs against that
contract. It is a small contract test, not a model inference benchmark.
Evidence and benchmarks
Performance claims should be backed by reproducible data in-repo.
- Benchmark guide and latest recorded result:
benchmarks/fasta_vs_biopython.md - Reproducible benchmark harness:
scripts/benchmark_fasta_vs_biopython.py
The benchmark compares FASTA parse+tokenization throughput against a Biopython baseline using the UniProt human reference proteome (UP000005640 / taxonomy 9606).
On the latest recorded run, biors tokenize completed the FASTA parse +
protein-20 tokenization + full JSON output path in 0.291s, while a Biopython
parse + protein-20 token/count baseline took 0.494s.
This is a workload-specific baseline, not a broad claim that bio-rs is faster than Biopython across all FASTA parsing workloads.
JSON contracts
CLI success output always uses the success envelope:
--json error mode always emits:
tokenize data is an array of records:
inspect data is a summary object:
package validate data is a validation report:
package bridge data is a runtime bridge report:
package verify data is a fixture verification report:
Development checks
The check suite runs cargo fmt, native Rust checks, a biors-core
wasm32-unknown-unknown build check, tests, and cargo clippy with warnings
denied.
Run the Rust library example:
Workspace
packages/
rust/
biors/ CLI
biors-core/ Core engine + contracts
schemas/
cli-error.v0.json
cli-success.v0.json
inspect-output.v0.json
package-manifest.v0.json
package-validation-report.v0.json
tokenize-output.v0.json
examples/
multi.fasta
protein-package/
fixtures/
observations.json
protein.fasta
Protein-20 alphabet
A C D E F G H I K L M N P Q R S T V W Y
Token ids follow that order, starting at 0.
Contributing
See CONTRIBUTING.md for local setup, checks, and PR expectations.
Public contracts
- CLI contract:
docs/cli-contract.md - Error code registry:
docs/error-codes.md - 1.0 candidates:
docs/public-contract-1.0-candidates.md - Security policy:
SECURITY.md
License
Dual licensed under MIT OR Apache-2.0.