biors 0.9.2

Command-line tools for bio-rs biological AI model input workflows.
biors-0.9.2 is not a library.

bio-rs

CI Release Benchmark Contracts License: MIT/Apache-2.0

bio-rs turns biological sequences into validated, model-ready inputs for bio-AI workflows.

FASTA -> validated protein sequence -> token ids -> model-ready JSON

Status: v0.9.2 — CLI and JSON contract freeze.

Why bio-rs?

Most bio-AI models are born in Python, but the tooling around them often needs to run somewhere else:

  • local CLIs
  • CI pipelines
  • servers
  • browsers
  • agents

bio-rs focuses on the boring but important layer before inference:

  • parse biological sequence input
  • validate it with structured diagnostics
  • tokenize it into stable IDs
  • emit machine-readable JSON contracts
  • keep preprocessing reproducible outside notebooks

The goal is not to replace Python research workflows.

The goal is to make the input layer around bio-AI models faster, more portable, and easier to trust.

Quickstart

Tokenize a FASTA file:

cargo run -p biors -- tokenize examples/protein.fasta

Pipe FASTA through stdin:

printf '>tiny\nACDE\n' | cargo run -p biors -- tokenize -

Validate FASTA:

cargo run -p biors -- fasta validate examples/protein.fasta

Verify package fixture outputs:

cargo run -p biors -- package verify \
  examples/protein-package/manifest.json \
  examples/protein-package/observations.json

Build model-ready input records:

cargo run -p biors -- model-input --max-length 8 examples/protein.fasta

Proof

bio-rs keeps performance claims tied to reproducible in-repo benchmarks.

Latest recorded FASTA benchmark baseline:

Workflow Mean time
biors tokenize parse + tokenize + full JSON output 0.385s
Biopython parse + protein-20 token/count loop 0.457s
Biopython parse only 0.056s

Benchmark details:

This is a workload-specific reference-proteome baseline, not a broad claim that bio-rs is faster than Biopython across all FASTA workloads or all researcher input shapes.

What works today

biors-core provides the Rust engine and data contracts.

biors provides the CLI surface.

Current v0.9.2 capabilities:

  • FASTA parsing and normalization
  • FASTA validation with line and record-index diagnostics
  • protein-20 tokenization
  • residue warning/error reporting
  • model-ready input records
  • attention masks
  • padding/truncation policy
  • model-input CLI output
  • model-input safety checks for unresolved residues
  • package manifest inspect/validate
  • runtime bridge planning reports
  • manifest-relative asset validation
  • SHA-256 package and fixture checksum verification
  • package fixture verification from observed artifact paths
  • JSON success/error envelopes

CLI examples

Inspect FASTA records:

cargo run -p biors -- inspect examples/protein.fasta

Tokenize FASTA records:

cargo run -p biors -- tokenize examples/protein.fasta

Tokenize a multi-record FASTA file:

cargo run -p biors -- tokenize examples/multi.fasta

Validate FASTA records:

cargo run -p biors -- fasta validate examples/protein.fasta

Emit structured JSON errors:

printf 'ACDE\n' | cargo run -p biors -- --json tokenize -

Build model-ready input records:

cargo run -p biors -- model-input --max-length 4 examples/protein.fasta

Inspect a package manifest:

cargo run -p biors -- package inspect examples/protein-package/manifest.json

Validate a package manifest:

cargo run -p biors -- package validate examples/protein-package/manifest.json

Plan a runtime bridge from a package manifest:

cargo run -p biors -- package bridge examples/protein-package/manifest.json

Verify package fixture observations:

cargo run -p biors -- package verify \
  examples/protein-package/manifest.json \
  examples/protein-package/observations.json

package verify expects the observations file to point at observed output artifact paths:

[
  {
    "name": "tiny-protein",
    "path": "observed/tiny.output.json"
  }
]

JSON contracts

Success output uses a stable envelope shape:

{
  "ok": true,
  "biors_version": "0.9.2",
  "input_hash": "fnv1a64:846a502e5067bc21",
  "data": {}
}

FASTA-backed commands keep input_hash in the legacy fnv1a64: format for backward compatibility. Package artifacts and fixture hashes use sha256: in manifests and verification reports.

--json error mode emits structured errors:

{
  "ok": false,
  "error": {
    "code": "fasta.missing_header",
    "message": "FASTA input must start with a header line beginning with '>' at line 1",
    "location": {
      "line": 1,
      "record_index": null
    }
  }
}

Tokenization output is record-oriented:

[
  {
    "id": "seq1",
    "length": 4,
    "alphabet": "protein-20",
    "valid": true,
    "tokens": [0, 1, 2, 3],
    "warnings": [],
    "errors": []
  }
]

Public contract docs:

Release history

Delivered:

  • 0.6.0: package manifest inspect/validate
  • 0.7.0: runtime bridge planning with package bridge
  • 0.8.0: fixture verification with package verify
  • 0.8.1: documentation, contribution guide, and benchmark baseline hardening
  • 0.9.0: CLI and JSON contract freeze baseline
  • 0.9.1: model-input CLI, checksum-backed package validation, benchmark refresh, and contract hardening
  • 0.9.2: model-input safety hardening for unresolved residues and automated GitHub Release creation

Next:

  • 1.0.0: stable public contracts and runtime-facing APIs after enough real-world package validation

Not yet

These are roadmap directions, not current capabilities:

  • hosted web workflows
  • Python bindings
  • model inference backends
  • package registry or plugin ecosystem
  • general-purpose chemistry tooling
  • structure tooling
  • no-code or low-code workflows

Development

Run checks:

scripts/check.sh

The check suite runs:

  • cargo fmt
  • Rust checks
  • biors-core wasm32-unknown-unknown build check
  • tests
  • cargo clippy with warnings denied

Reproduce the FASTA benchmark:

cargo build --release -p biors
pip3 install biopython
python3 scripts/benchmark_fasta_vs_biopython.py
cat benchmarks/fasta_vs_biopython.json

Run the Rust library example:

cargo run -p biors-core --example tokenize

Workspace

packages/
  rust/
    biors/       CLI
    biors-core/  Core engine + contracts

schemas/
  cli-error.v0.json
  cli-success.v0.json
  fasta-validation-output.v0.json
  inspect-output.v0.json
  model-input-output.v0.json
  package-bridge-output.v0.json
  package-inspect-output.v0.json
  package-manifest.v0.json
  package-validation-report.v0.json
  package-verify-output.v0.json
  tokenize-output.v0.json

examples/
  protein.fasta
  multi.fasta
  protein-package/
    models/
    manifest.json
    observations.json
    fixtures/
    observed/
    tokenizers/
    vocabs/

Protein-20 alphabet

A C D E F G H I K L M N P Q R S T V W Y

Token IDs follow that order, starting at 0.

Contributing

See CONTRIBUTING.md for local setup, checks, and PR expectations.

License

Dual licensed under MIT OR Apache-2.0.