biors-0.15.0 is not a library.

bio-rs

bio-rs turns biological sequences into validated, model-ready inputs for bio-AI workflows.

FASTA -> validated protein/DNA/RNA sequence -> protein token ids -> model-ready JSON

Status: pre-1.0 CLI and JSON contract stabilization.

Why bio-rs?

Most bio-AI models are born in Python, but the tooling around them often needs to run somewhere else:

local CLIs
CI pipelines
servers
browsers
agents

bio-rs focuses on the boring but important layer before inference:

parse biological sequence input
validate it with structured diagnostics
tokenize it into stable IDs
emit machine-readable JSON contracts
keep preprocessing reproducible outside notebooks

The goal is not to replace Python research workflows.

The goal is to make the input layer around bio-AI models faster, more portable, and easier to trust.

Quickstart

Install the published CLI:

cargo install biors --version 0.15.0
biors --version

Tokenize a FASTA file:

biors tokenize examples/protein.fasta

Pipe FASTA through stdin:

printf '>tiny\nACDE\n' | biors tokenize -

Validate FASTA:

biors fasta validate examples/protein.fasta

Validate mixed biological FASTA with per-record kind detection:

biors seq validate examples/protein.fasta

Verify package fixture outputs:

biors package verify \
  examples/protein-package/manifest.json \
  examples/protein-package/observations.json

Build model-ready input records:

biors model-input --max-length 8 examples/protein.fasta

Proof

bio-rs keeps performance claims tied to reproducible in-repo benchmarks.

Latest recorded FASTA benchmark baseline:

Dataset	Matched workload	bio-rs core mean	Biopython mean	bio-rs speedup
Human proteome	Parse + validation	0.036s	0.441s	12.31x
Human proteome	Parse + tokenization	0.061s	0.441s	7.27x
100MB+ FASTA	Parse + validation	0.291s	3.972s	13.67x
100MB+ FASTA	Parse + tokenization	0.507s	4.002s	7.90x
Many short records	Parse + validation	0.007s	0.057s	8.25x
Many short records	Parse + tokenization	0.010s	0.057s	5.54x
Single long sequence	Parse + validation	0.006s	0.034s	5.95x
Single long sequence	Parse + tokenization	0.007s	0.035s	4.75x

Benchmark details:

Datasets:
- UniProt human reference proteome (UP000005640, 9606)
- 100MB+ large FASTA generated by repeating the same real proteome to isolate large-input throughput
- 20,000 short 48-residue records generated from the same proteome residue stream
- one 960,000-residue sequence generated from the same proteome residue stream
Matched workloads:
- pure parse
- parse plus validation
- parse plus tokenization
Current best recorded raw throughput:
- human proteome parse + validation: 319.9M residues/s, 365.8 MB/s
- 100MB+ FASTA parse + validation: 354.7M residues/s, 405.6 MB/s
- human proteome parse + tokenization: 189.0M residues/s, 216.1 MB/s
- 100MB+ FASTA parse + tokenization: 203.4M residues/s, 232.6 MB/s
Benchmark doc: benchmarks/fasta_vs_biopython.md
Benchmark script: scripts/benchmark_fasta_vs_biopython.py

This benchmark measures biors-core directly and excludes CLI startup and JSON serialization overhead. It is still workload-specific, not a broad claim that bio-rs is faster than Biopython across every FASTA workload or researcher input shape.

What works today

biors-core provides the Rust engine and data contracts.

biors provides the CLI surface.

Current capabilities:

FASTA parsing and normalization
shared FASTA parser/tokenizer scanner with an ASCII fast path and Unicode fallback
buffered reader APIs for FASTA parse/validate/tokenize paths
FASTA validation with line and record-index diagnostics
FASTA record identifier validation
protein-20 tokenization
JSON vocab loading for tokenizer contracts
positional token alignment preserved with explicit unknown-token IDs for unresolved residues
residue warning/error reporting
model-ready input records
attention masks
padding/truncation policy
model-input CLI output
model-input safety checks for unresolved residues
explicit checked and unchecked model-input builders
writer-based CLI success JSON serialization to reduce peak allocations for large outputs
package manifest inspect/validate
typed package validation issue codes
typed package manifest enums for schema version, model format, runtime target, and tensor dtypes
runtime bridge planning reports
manifest-relative asset validation
package path escape rejection for manifest and observation assets
SHA-256 package and fixture checksum verification
package fixture verification from observed artifact paths
structured package fixture mismatch issue codes and first-difference reports
committed FASTA, tokenizer, manifest, and verification fixtures
JSON success/error envelopes

CLI examples

Inspect FASTA records:

cargo run -p biors -- inspect examples/protein.fasta

Tokenize FASTA records:

cargo run -p biors -- tokenize examples/protein.fasta

Tokenize a multi-record FASTA file:

cargo run -p biors -- tokenize examples/multi.fasta

Validate FASTA records:

cargo run -p biors -- fasta validate examples/protein.fasta

Emit structured JSON errors:

printf 'ACDE\n' | cargo run -p biors -- --json tokenize -

Build model-ready input records:

cargo run -p biors -- model-input --max-length 4 examples/protein.fasta

Inspect a package manifest:

cargo run -p biors -- package inspect examples/protein-package/manifest.json

Validate a package manifest:

cargo run -p biors -- package validate examples/protein-package/manifest.json

Plan a runtime bridge from a package manifest:

cargo run -p biors -- package bridge examples/protein-package/manifest.json

Verify package fixture observations:

cargo run -p biors -- package verify \
  examples/protein-package/manifest.json \
  examples/protein-package/observations.json

package verify expects the observations file to point at observed output artifact paths:

[
  {
    "name": "tiny-protein",
    "path": "observed/tiny.output.json"
  }
]

JSON contracts

Success output uses a stable envelope shape:

{
  "ok": true,
  "biors_version": "0.x.y",
  "input_hash": "fnv1a64:846a502e5067bc21",
  "data": {}
}

FASTA-backed commands keep input_hash in the legacy fnv1a64: format for backward compatibility. Package artifacts and fixture hashes use sha256: in manifests and verification reports.

--json error mode emits structured errors:

{
  "ok": false,
  "error": {
    "code": "fasta.missing_header",
    "message": "FASTA input must start with a header line beginning with '>' at line 1",
    "location": {
      "line": 1,
      "record_index": null
    }
  }
}

Tokenization output is record-oriented:

[
  {
    "id": "seq1",
    "length": 4,
    "alphabet": "protein-20",
    "valid": true,
    "tokens": [0, 1, 2, 3],
    "warnings": [],
    "errors": []
  }
]

Public contract docs:

Release history

Delivered:

0.15.0: biological sequence UX polish with biors seq validate, auto-detect-by-default validation, kind-specific messages, and refreshed docs
0.14.0: multi-alphabet FASTA validation with per-record kind assignment, --kind CLI override, mixed-kind summaries, and updated FASTA validation schema
0.13.0: biors-core DNA/RNA validation draft with SequenceKind, IUPAC nucleotide policies, auto-detection, and diagnostic sequence issues
0.12.8: biors-core SRP refactor for package, sequence, verification, and FASTA scanner internals with public API docs refreshed
0.12.7: tokenizer hot-path cleanup, SRP-focused core module split, module-size guardrail, and refreshed benchmark proof assets
0.12.6: borrowed protein-20 vocabulary API, lowercase-aware token lookup fast path, tokenizer module split, and refreshed benchmark proof assets
0.12.5: validation-only FASTA reader path, dead analysis-path cleanup, benchmark artifact comparison helper, and refreshed proof assets
0.12.4: byte-sink FASTA scanner dispatch, tokenization-only sink path, install smoke gate, and refreshed benchmark proof assets
0.12.3: byte-buffered FASTA reader scanning, static residue/token lookup tables, and expanded shape-profile benchmark proof assets
0.12.2: published CLI --version support and version-verification docs for reproducible installs
0.12.1: release workflow publish-order guard, published CLI quickstart, professional-readiness audit, and summary-only FASTA inspect path
0.12.0: release-candidate documentation, full workflow e2e coverage, MSRV/citation policy drafts, and release notes
0.11.0: benchmark reproducibility metadata, generated benchmark report checks, and refreshed speed/memory proof assets
0.10.0: fixture and verification hardening with shared byte-aware FASTA scanning, tokenizer invariants, and structured mismatch reports
0.9.8: tokenization lookup and CLI JSON writer performance improvements with refreshed reader-based benchmarks
0.9.7: buffered FASTA reader APIs, typed package validation issues, CLI module refactor, and explicit model-input builder safety
0.9.6: FASTA identifier validation, model-input policy validation, package path escape rejection, and JSON vocab loading
0.9.5: core-throughput benchmark harness, matched-workload benchmark refresh, workflow/cache tightening, and git-hook install helper
0.9.4: tokenizer positional alignment preservation, FASTA single-pass tokenization/validation path, typed package manifest enums, and benchmark refresh
0.9.3: release workflow fix for automatic GitHub Release creation after crates publish
0.9.2: model-input safety hardening for unresolved residues and automated GitHub Release creation
0.9.1: model-input CLI, checksum-backed package validation, benchmark refresh, and contract hardening
0.9.0: CLI and JSON contract freeze baseline
0.8.1: documentation, contribution guide, and benchmark baseline hardening
0.8.0: fixture verification with package verify
0.7.0: runtime bridge planning with package bridge
0.6.0: package manifest inspect/validate

first stable release: stable public contracts and runtime-facing APIs after enough real-world package validation

Not yet

These are roadmap directions, not current capabilities:

hosted web workflows
Python bindings
model inference backends
package registry or plugin ecosystem
general-purpose chemistry tooling
structure tooling
no-code or low-code workflows

Development

Run checks:

scripts/check.sh

Run the faster local commit gate:

scripts/check-fast.sh

The check suite runs:

cargo fmt
shell and Python syntax checks for repo scripts
benchmark Markdown regeneration check
release workflow publish-order invariant check
Rust checks
biors-core wasm32-unknown-unknown build check
tests
cargo clippy with warnings denied

Reproduce the FASTA benchmark:

cargo build --release -p biors-core --example benchmark_fasta
python3 -m venv .venv-bench
. .venv-bench/bin/activate
pip install biopython
python scripts/benchmark_fasta_vs_biopython.py
cat benchmarks/fasta_vs_biopython.json

The benchmark script updates both benchmarks/fasta_vs_biopython.json and benchmarks/fasta_vs_biopython.md. scripts/check-benchmark-docs.sh verifies that the Markdown report still matches the JSON artifact.

Compare two benchmark artifacts:

python scripts/compare-benchmark-artifacts.py before.json after.json

Run the Rust library example:

cargo run -p biors-core --example tokenize

Workspace

packages/
  rust/
    biors/       CLI
    biors-core/  Core engine + contracts

schemas/
  cli-error.v0.json
  cli-success.v0.json
  fasta-validation-output.v0.json
  inspect-output.v0.json
  model-input-output.v0.json
  package-bridge-output.v0.json
  package-inspect-output.v0.json
  package-manifest.v0.json
  package-validation-report.v0.json
  package-verify-output.v0.json
  tokenize-output.v0.json

examples/
  protein.fasta
  multi.fasta
  protein-package/
    models/
    manifest.json
    observations.json
    fixtures/
    observed/
    tokenizers/
    vocabs/

Protein-20 alphabet

A C D E F G H I K L M N P Q R S T V W Y

Token IDs follow that order, starting at 0.

Contributing

See CONTRIBUTING.md for local setup, checks, and PR expectations.

License

Dual licensed under MIT OR Apache-2.0. If you use bio-rs in research software or publications, cite the repository and version via CITATION.cff.

biors 0.15.0