# bio-rs
[](https://github.com/bio-rs/bio-rs/actions)
[](https://github.com/bio-rs/bio-rs/actions/workflows/release.yml)
[](benchmarks/fasta_vs_biopython.md)
[](docs/public-contract-1.0-candidates.md)
[](LICENSE-MIT)
bio-rs turns biological sequences into validated, model-ready inputs for bio-AI workflows.
```txt
FASTA -> validated protein sequence -> token ids -> model-ready JSON
```
> Status: pre-1.0 CLI and JSON contract stabilization.
## Why bio-rs?
Most bio-AI models are born in Python, but the tooling around them often needs to run somewhere else:
- local CLIs
- CI pipelines
- servers
- browsers
- agents
bio-rs focuses on the boring but important layer before inference:
- parse biological sequence input
- validate it with structured diagnostics
- tokenize it into stable IDs
- emit machine-readable JSON contracts
- keep preprocessing reproducible outside notebooks
The goal is not to replace Python research workflows.
The goal is to make the input layer around bio-AI models faster, more portable, and easier to trust.
## Quickstart
Install the published CLI:
```bash
cargo install biors --version 0.12.3
biors --version
```
Tokenize a FASTA file:
```bash
biors tokenize examples/protein.fasta
```
Pipe FASTA through stdin:
```bash
printf '>tiny\nACDE\n' | biors tokenize -
```
Validate FASTA:
```bash
biors fasta validate examples/protein.fasta
```
Verify package fixture outputs:
```bash
biors package verify \
examples/protein-package/manifest.json \
examples/protein-package/observations.json
```
Build model-ready input records:
```bash
biors model-input --max-length 8 examples/protein.fasta
```
## Proof
bio-rs keeps performance claims tied to reproducible in-repo benchmarks.
Latest recorded FASTA benchmark baseline:
| Human proteome | Parse + validation | **0.041s** | 0.460s | **11.33x** |
| Human proteome | Parse + tokenization | **0.085s** | 0.459s | **5.42x** |
| 100MB+ FASTA | Parse + validation | **0.323s** | 4.134s | **12.81x** |
| 100MB+ FASTA | Parse + tokenization | **0.741s** | 4.145s | **5.59x** |
| Many short records | Parse + validation | **0.009s** | 0.058s | **6.73x** |
| Many short records | Parse + tokenization | **0.018s** | 0.059s | **3.32x** |
| Single long sequence | Parse + validation | **0.007s** | 0.035s | **4.97x** |
| Single long sequence | Parse + tokenization | **0.009s** | 0.035s | **3.76x** |
Benchmark details:
- Datasets:
- UniProt human reference proteome (`UP000005640`, `9606`)
- 100MB+ large FASTA generated by repeating the same real proteome to isolate large-input throughput
- 20,000 short 48-residue records generated from the same proteome residue stream
- one 960,000-residue sequence generated from the same proteome residue stream
- Matched workloads:
- pure parse
- parse plus validation
- parse plus tokenization
- Current best recorded raw throughput:
- human proteome parse + validation: `282.4M residues/s`, `322.9 MB/s`
- 100MB+ FASTA parse + validation: `319.6M residues/s`, `365.4 MB/s`
- human proteome parse + tokenization: `135.3M residues/s`, `154.7 MB/s`
- 100MB+ FASTA parse + tokenization: `139.1M residues/s`, `159.0 MB/s`
- Benchmark doc: [benchmarks/fasta_vs_biopython.md](benchmarks/fasta_vs_biopython.md)
- Benchmark script: [scripts/benchmark_fasta_vs_biopython.py](scripts/benchmark_fasta_vs_biopython.py)
This benchmark measures `biors-core` directly and excludes CLI startup and JSON
serialization overhead. It is still workload-specific, not a broad claim that
bio-rs is faster than Biopython across every FASTA workload or researcher input
shape.
## What works today
`biors-core` provides the Rust engine and data contracts.
`biors` provides the CLI surface.
Current capabilities:
- FASTA parsing and normalization
- shared FASTA parser/tokenizer scanner with an ASCII fast path and Unicode fallback
- buffered reader APIs for FASTA parse/validate/tokenize paths
- FASTA validation with line and record-index diagnostics
- FASTA record identifier validation
- protein-20 tokenization
- JSON vocab loading for tokenizer contracts
- positional token alignment preserved with explicit unknown-token IDs for unresolved residues
- residue warning/error reporting
- model-ready input records
- attention masks
- padding/truncation policy
- `model-input` CLI output
- model-input safety checks for unresolved residues
- explicit checked and unchecked model-input builders
- writer-based CLI success JSON serialization to reduce peak allocations for large outputs
- package manifest inspect/validate
- typed package validation issue codes
- typed package manifest enums for schema version, model format, runtime target, and tensor dtypes
- runtime bridge planning reports
- manifest-relative asset validation
- package path escape rejection for manifest and observation assets
- SHA-256 package and fixture checksum verification
- package fixture verification from observed artifact paths
- structured package fixture mismatch issue codes and first-difference reports
- committed FASTA, tokenizer, manifest, and verification fixtures
- JSON success/error envelopes
## CLI examples
Inspect FASTA records:
```bash
cargo run -p biors -- inspect examples/protein.fasta
```
Tokenize FASTA records:
```bash
cargo run -p biors -- tokenize examples/protein.fasta
```
Tokenize a multi-record FASTA file:
```bash
cargo run -p biors -- tokenize examples/multi.fasta
```
Validate FASTA records:
```bash
cargo run -p biors -- fasta validate examples/protein.fasta
```
Emit structured JSON errors:
```bash
printf 'ACDE\n' | cargo run -p biors -- --json tokenize -
```
Build model-ready input records:
```bash
cargo run -p biors -- model-input --max-length 4 examples/protein.fasta
```
Inspect a package manifest:
```bash
cargo run -p biors -- package inspect examples/protein-package/manifest.json
```
Validate a package manifest:
```bash
cargo run -p biors -- package validate examples/protein-package/manifest.json
```
Plan a runtime bridge from a package manifest:
```bash
cargo run -p biors -- package bridge examples/protein-package/manifest.json
```
Verify package fixture observations:
```bash
cargo run -p biors -- package verify \
examples/protein-package/manifest.json \
examples/protein-package/observations.json
```
`package verify` expects the observations file to point at observed output artifact paths:
```json
[
{
"name": "tiny-protein",
"path": "observed/tiny.output.json"
}
]
```
## JSON contracts
Success output uses a stable envelope shape:
```json
{
"ok": true,
"biors_version": "0.x.y",
"input_hash": "fnv1a64:846a502e5067bc21",
"data": {}
}
```
FASTA-backed commands keep `input_hash` in the legacy `fnv1a64:` format for backward compatibility. Package artifacts and fixture hashes use `sha256:` in manifests and verification reports.
`--json` error mode emits structured errors:
```json
{
"ok": false,
"error": {
"code": "fasta.missing_header",
"message": "FASTA input must start with a header line beginning with '>' at line 1",
"location": {
"line": 1,
"record_index": null
}
}
}
```
Tokenization output is record-oriented:
```json
[
{
"id": "seq1",
"length": 4,
"alphabet": "protein-20",
"valid": true,
"tokens": [0, 1, 2, 3],
"warnings": [],
"errors": []
}
]
```
Public contract docs:
- [Quickstart](docs/quickstart.md)
- [Professional readiness](docs/professional-readiness.md)
- [CLI contract](docs/cli-contract.md)
- [Error code registry](docs/error-codes.md)
- [1.0 contract candidates](docs/public-contract-1.0-candidates.md)
- [1.0 release candidate path](docs/release-candidate-1.0.md)
- [API and schema review](docs/api-review.md)
- [MSRV policy draft](docs/msrv.md)
- [Versioning policy](docs/versioning.md)
- [JSON schemas](schemas)
- [Citation metadata](CITATION.cff)
## Release history
Delivered:
- `0.12.3`: byte-buffered FASTA reader scanning, static residue/token lookup tables, and expanded shape-profile benchmark proof assets
- `0.12.2`: published CLI `--version` support and version-verification docs for reproducible installs
- `0.12.1`: release workflow publish-order guard, published CLI quickstart, professional-readiness audit, and summary-only FASTA inspect path
- `0.12.0`: release-candidate documentation, full workflow e2e coverage, MSRV/citation policy drafts, and release notes
- `0.11.0`: benchmark reproducibility metadata, generated benchmark report checks, and refreshed speed/memory proof assets
- `0.10.0`: fixture and verification hardening with shared byte-aware FASTA scanning, tokenizer invariants, and structured mismatch reports
- `0.9.8`: tokenization lookup and CLI JSON writer performance improvements with refreshed reader-based benchmarks
- `0.9.7`: buffered FASTA reader APIs, typed package validation issues, CLI module refactor, and explicit model-input builder safety
- `0.9.6`: FASTA identifier validation, model-input policy validation, package path escape rejection, and JSON vocab loading
- `0.9.5`: core-throughput benchmark harness, matched-workload benchmark refresh, workflow/cache tightening, and git-hook install helper
- `0.9.4`: tokenizer positional alignment preservation, FASTA single-pass tokenization/validation path, typed package manifest enums, and benchmark refresh
- `0.9.3`: release workflow fix for automatic GitHub Release creation after crates publish
- `0.9.2`: model-input safety hardening for unresolved residues and automated GitHub Release creation
- `0.9.1`: model-input CLI, checksum-backed package validation, benchmark refresh, and contract hardening
- `0.9.0`: CLI and JSON contract freeze baseline
- `0.8.1`: documentation, contribution guide, and benchmark baseline hardening
- `0.8.0`: fixture verification with `package verify`
- `0.7.0`: runtime bridge planning with `package bridge`
- `0.6.0`: package manifest inspect/validate
Next:
- first stable release: stable public contracts and runtime-facing APIs after enough real-world package validation
## Not yet
These are roadmap directions, not current capabilities:
- hosted web workflows
- Python bindings
- model inference backends
- package registry or plugin ecosystem
- general-purpose chemistry tooling
- structure tooling
- no-code or low-code workflows
## Development
Run checks:
```bash
scripts/check.sh
```
Run the faster local commit gate:
```bash
scripts/check-fast.sh
```
The check suite runs:
- `cargo fmt`
- shell and Python syntax checks for repo scripts
- benchmark Markdown regeneration check
- release workflow publish-order invariant check
- Rust checks
- `biors-core` `wasm32-unknown-unknown` build check
- tests
- `cargo clippy` with warnings denied
Reproduce the FASTA benchmark:
```bash
cargo build --release -p biors-core --example benchmark_fasta
python3 -m venv .venv-bench
. .venv-bench/bin/activate
pip install biopython
python scripts/benchmark_fasta_vs_biopython.py
cat benchmarks/fasta_vs_biopython.json
```
The benchmark script updates both `benchmarks/fasta_vs_biopython.json` and
`benchmarks/fasta_vs_biopython.md`. `scripts/check-benchmark-docs.sh` verifies
that the Markdown report still matches the JSON artifact.
Run the Rust library example:
```bash
cargo run -p biors-core --example tokenize
```
## Workspace
```txt
packages/
rust/
biors/ CLI
biors-core/ Core engine + contracts
schemas/
cli-error.v0.json
cli-success.v0.json
fasta-validation-output.v0.json
inspect-output.v0.json
model-input-output.v0.json
package-bridge-output.v0.json
package-inspect-output.v0.json
package-manifest.v0.json
package-validation-report.v0.json
package-verify-output.v0.json
tokenize-output.v0.json
examples/
protein.fasta
multi.fasta
protein-package/
models/
manifest.json
observations.json
fixtures/
observed/
tokenizers/
vocabs/
```
## Protein-20 alphabet
```txt
A C D E F G H I K L M N P Q R S T V W Y
```
Token IDs follow that order, starting at `0`.
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md) for local setup, checks, and PR expectations.
## License
Dual licensed under MIT OR Apache-2.0. If you use bio-rs in research software
or publications, cite the repository and version via [CITATION.cff](CITATION.cff).