biors 0.9.0

Command-line tools for bio-rs biological AI model input workflows.
biors-0.9.0 is not a library.

bio-rs

CI Release License: MIT/Apache-2.0

Rust workspace for practical biological AI input tooling.

Status: v0.9.0 (workspace/package version in Cargo.toml)

This repository focuses on functionality that is already implemented and testable today:

  • FASTA parsing (parse_fasta_records)
  • protein-20 tokenization (tokenize_fasta_records)
  • FASTA validation (biors fasta validate)
  • model-ready input shaping (ModelInput)
  • package manifest inspect/validate/bridge planning
  • fixture verification (package verify)
  • frozen CLI JSON success/error envelope candidates

What exists in v0.9.0

Core (biors-core)

biors-core is the engine crate. It contains data contracts and pure Rust logic:

  • FASTA record parsing and normalization
  • FASTA validation with line and record-index diagnostics
  • protein-20 tokenization and residue issue reporting
  • model-ready input records with attention masks and padding/truncation policy
  • package manifest structs + validation/inspection
  • runtime bridge planning report generation
  • fixture verification report generation

Use this crate when embedding bio-rs in Rust services, libraries, or tooling.

CLI (biors)

biors is the command-line surface built on top of biors-core.

  • Reads FASTA/JSON files (or stdin for FASTA)
  • Executes core workflows
  • Emits machine-readable JSON success envelopes
  • Supports JSON error mode with structured error codes
  • Uses non-zero exit codes on invalid operations

Use this crate when you need shell-first workflows, scripting, or CI checks.

Release history and roadmap

Delivered

  • 0.6.0: package manifest inspect/validate
  • 0.7.0: runtime bridge planning (package bridge)
  • 0.8.0: fixture verification (package verify)
  • 0.8.1: documentation, contribution guide, and benchmark baseline hardening
  • 0.9.0: CLI and JSON contract freeze candidates

Next (post-0.9)

  • 1.0.0 target: stable contracts and runtime-facing APIs after enough real-world package validation

0.7.0 capability notes are kept only as release history above; all "current" descriptions in this README are aligned to 0.9.0.

Not yet

These are roadmap directions, not current capabilities:

  • hosted web workflows
  • Python bindings
  • model inference backends
  • package registry or plugin ecosystem
  • general-purpose chemistry or structure tooling

Quickstart

Inspect FASTA records:

cargo run -p biors -- inspect examples/protein.fasta

Tokenize FASTA records:

cargo run -p biors -- tokenize examples/protein.fasta

Tokenize FASTA records from stdin:

cat examples/protein.fasta | cargo run -p biors -- tokenize -

Tokenize a multi-record FASTA file:

cargo run -p biors -- tokenize examples/multi.fasta

Validate FASTA records:

cargo run -p biors -- fasta validate examples/protein.fasta

Emit structured JSON errors:

printf 'ACDE\n' | cargo run -p biors -- --json tokenize -

Inspect a portable model package manifest:

cargo run -p biors -- package inspect examples/protein-package/manifest.json

Validate a portable model package manifest:

cargo run -p biors -- package validate examples/protein-package/manifest.json

Plan the portable runtime bridge for a package:

cargo run -p biors -- package bridge examples/protein-package/manifest.json

Verify package fixture observations:

cargo run -p biors -- package verify \
  examples/protein-package/manifest.json \
  examples/protein-package/observations.json

Proof asset

This is the smallest reproducible package verification example in the repository.

Command:

cargo run -p biors -- package verify \
  examples/protein-package/manifest.json \
  examples/protein-package/observations.json

Input:

  • package manifest: examples/protein-package/manifest.json
  • observed fixture map: examples/protein-package/observations.json
  • expected output fixture: examples/protein-package/fixtures/tiny.output.json

Output shape:

{
  "ok": true,
  "biors_version": "0.9.0",
  "data": {
    "package": "protein-seed",
    "fixtures": 1,
    "passed": 1,
    "failed": 0,
    "results": [
      {
        "name": "tiny-protein",
        "input": "fixtures/tiny.fasta",
        "expected_output": "fixtures/tiny.output.json",
        "observed_output": "fixtures/tiny.output.json",
        "status": "passed",
        "issue": null
      }
    ]
  }
}

This proves that a portable package manifest can point to fixture inputs and expected JSON outputs, and that biors can check observed outputs against that contract. It is a small contract test, not a model inference benchmark.

Evidence and benchmarks

Performance claims should be backed by reproducible data in-repo.

  • Benchmark guide and latest recorded result: benchmarks/fasta_vs_biopython.md
  • Reproducible benchmark harness: scripts/benchmark_fasta_vs_biopython.py

The benchmark compares FASTA parse+tokenization throughput against a Biopython baseline using the UniProt human reference proteome (UP000005640 / taxonomy 9606).

On the latest recorded run, biors tokenize completed the FASTA parse + protein-20 tokenization + full JSON output path in 0.291s, while a Biopython parse + protein-20 token/count baseline took 0.494s.

This is a workload-specific baseline, not a broad claim that bio-rs is faster than Biopython across all FASTA parsing workloads.

JSON contracts

CLI success output always uses the success envelope:

{
  "ok": true,
  "biors_version": "0.9.0",
  "input_hash": "fnv1a64:846a502e5067bc21",
  "data": {}
}

--json error mode always emits:

{
  "ok": false,
  "error": {
    "code": "fasta.missing_header",
    "message": "FASTA input must start with a header line beginning with '>' at line 1",
    "location": {
      "line": 1,
      "record_index": null
    }
  }
}

tokenize data is an array of records:

[
  {
    "id": "seq1",
    "length": 4,
    "alphabet": "protein-20",
    "valid": true,
    "tokens": [0, 1, 2, 3],
    "warnings": [],
    "errors": []
  }
]

inspect data is a summary object:

{
  "records": 1,
  "total_length": 4,
  "valid_records": 1,
  "warning_count": 0,
  "error_count": 0
}

package validate data is a validation report:

{
  "valid": true,
  "issues": []
}

package bridge data is a runtime bridge report:

{
  "ready": true,
  "backend": "onnx-webgpu",
  "target": "browser-wasm-webgpu",
  "execution_provider": "webgpu",
  "blocking_issues": []
}

package verify data is a fixture verification report:

{
  "package": "protein-seed",
  "fixtures": 1,
  "passed": 1,
  "failed": 0,
  "results": [
    {
      "name": "tiny-protein",
      "input": "fixtures/tiny.fasta",
      "expected_output": "fixtures/tiny.output.json",
      "observed_output": "fixtures/tiny.output.json",
      "status": "passed",
      "issue": null
    }
  ]
}

Development checks

scripts/check.sh

The check suite runs cargo fmt, native Rust checks, a biors-core wasm32-unknown-unknown build check, tests, and cargo clippy with warnings denied.

Run the Rust library example:

cargo run -p biors-core --example tokenize

Workspace

packages/
  rust/
    biors/       CLI
    biors-core/  Core engine + contracts
schemas/
  cli-error.v0.json
  cli-success.v0.json
  inspect-output.v0.json
  package-manifest.v0.json
  package-validation-report.v0.json
  tokenize-output.v0.json
examples/
  multi.fasta
  protein-package/
    fixtures/
    observations.json
  protein.fasta

Protein-20 alphabet

A C D E F G H I K L M N P Q R S T V W Y

Token ids follow that order, starting at 0.

Contributing

See CONTRIBUTING.md for local setup, checks, and PR expectations.

Public contracts

License

Dual licensed under MIT OR Apache-2.0.