biors-0.34.0 is not a library.

bio-rs

bio-rs turns biological sequences into validated, model-ready inputs for bio-AI workflows.

FASTA -> validated protein/DNA/RNA sequence -> protein token ids -> model-ready JSON

Status: pre-1.0 CLI and JSON contract stabilization.

Why bio-rs?

Most bio-AI models are born in Python, but the tooling around them often needs to run somewhere else:

local CLIs
CI pipelines
servers
browsers
agents

bio-rs focuses on the boring but important layer before inference:

parse biological sequence input
validate it with structured diagnostics
tokenize it into stable IDs
emit machine-readable JSON contracts
keep preprocessing reproducible outside notebooks

The goal is not to replace Python research workflows.

The goal is to make the input layer around bio-AI models faster, more portable, and easier to trust.

Quickstart

cargo install biors --version 0.34.0
biors tokenize examples/protein.fasta
biors workflow --max-length 8 examples/protein.fasta
biors batch validate --kind auto examples/
biors tokenizer inspect --profile protein-20-special

Full commands, demos, and install options: docs/quickstart.md

Proof

bio-rs keeps performance claims tied to reproducible in-repo benchmarks.

Latest recorded FASTA benchmark baseline:

Dataset	Matched workload	bio-rs core mean	Biopython mean	bio-rs speedup
Human proteome	Parse + validation	0.036s	0.584s	16.09x
Human proteome	Parse + tokenization	0.061s	0.587s	9.68x
100MB+ FASTA	Parse + validation	0.294s	3.994s	13.59x
100MB+ FASTA	Parse + tokenization	0.492s	4.040s	8.22x
Many short records	Parse + validation	0.007s	0.204s	28.35x
Many short records	Parse + tokenization	0.010s	0.205s	20.54x
Single long sequence	Parse + validation	0.005s	0.176s	34.48x
Single long sequence	Parse + tokenization	0.007s	0.177s	26.67x

Benchmark details:

Datasets:
- UniProt human reference proteome (UP000005640, 9606)
- 100MB+ large FASTA generated by repeating the same real proteome to isolate large-input throughput
- 20,000 short 48-residue records generated from the same proteome residue stream
- one 960,000-residue sequence generated from the same proteome residue stream
Matched workloads:
- pure parse
- parse plus validation
- parse plus tokenization
Current best recorded raw throughput:
- human proteome parse + validation: 315.4M residues/s, 360.6 MB/s
- 100MB+ FASTA parse + validation: 350.8M residues/s, 401.1 MB/s
- human proteome parse + tokenization: 189.0M residues/s, 216.1 MB/s
- 100MB+ FASTA parse + tokenization: 209.7M residues/s, 239.8 MB/s
Benchmark doc: benchmarks/fasta_vs_biopython.md
Benchmark script: scripts/benchmark_fasta_vs_biopython.py

This benchmark measures biors-core directly and excludes CLI startup and JSON serialization overhead. It is still workload-specific, not a broad claim that bio-rs is faster than Biopython across every FASTA workload or researcher input shape.

What works today

biors-core provides the Rust engine and data contracts.

biors provides the CLI surface.

Current capabilities:

FASTA parsing and normalization
shared FASTA parser/tokenizer scanner with an ASCII fast path and Unicode fallback
buffered reader APIs for FASTA parse/validate/tokenize paths
FASTA validation with line and record-index diagnostics
FASTA record identifier validation
protein-20 tokenization
protein-20-special tokenization with explicit UNK/PAD/CLS/SEP/MASK policy
tokenizer JSON config loading
tokenizer inspection JSON output
JSON vocab loading for tokenizer contracts
positional token alignment preserved with explicit unknown-token IDs for unresolved residues
residue warning/error reporting
model-ready input records
attention masks
padding/truncation policy
model-input CLI output
workflow CLI output that combines validation, tokenization, model input, readiness issues, and reproducibility provenance
workflow provenance hashes for tokenizer vocabulary and output-content reproducibility
diff CLI output for canonical JSON/raw output comparison with SHA-256 hashes and first-difference metadata
pipeline CLI output for no-config validate -> tokenize -> export workflow composition
pipeline lockfile generation for config-driven workflows with package/model and runtime provenance pins
debug CLI output for sequence -> token -> model-input step inspection and compact residue error visualization
batch validate for multiple files, recursive directory inputs, quoted glob inputs, empty-glob errors, and memory-bounded validation summaries
doctor CLI diagnostics for platform, toolchain, WASM target, and committed fixture readiness
model-input safety checks for unresolved residues
explicit checked and unchecked model-input builders
writer-based CLI success JSON serialization to reduce peak allocations for large outputs
package manifest inspect/validate
typed package validation issue codes
typed package manifest enums for schema version, model format, runtime target, and tensor dtypes
runtime bridge planning reports
manifest-relative asset validation
package preprocessing steps can reference checked pipeline config artifacts
package path escape rejection for manifest and observation assets
SHA-256 package and fixture checksum verification
package fixture verification from observed artifact paths
structured package fixture mismatch issue codes and first-difference reports
committed FASTA, tokenizer, manifest, and verification fixtures
draft model-input contract and reference Python preprocessing parity fixtures
JSON success/error envelopes

Documentation

Quickstart — install, first commands, demos
Launch demo — researcher-facing demo workflow
Installation and distribution — cargo, binaries, completions
CLI contract — commands, JSON envelopes, exit codes
Package format — manifest layout and research metadata
Pipeline config — config-driven static preprocessing workflows
Error code registry
Reliability and input safety
Python interop
WASM readiness
1.0 contract candidates
Versioning policy
Schema versioning
Final release checklist
Changelog
JSON schemas
Citation metadata

Not yet

These are roadmap directions, not current capabilities:

hosted web workflows
Python bindings
model inference backends
package registry or plugin ecosystem
general-purpose chemistry tooling
structure tooling
no-code or low-code workflows

Development

Run checks:

scripts/check.sh

Run the faster local commit gate:

scripts/check-fast.sh

The check suite runs:

cargo fmt
shell and Python syntax checks for repo scripts
benchmark Markdown regeneration check
release workflow publish-order invariant check
Rust checks
biors-core wasm32-unknown-unknown build check
tests
cargo clippy with warnings denied

Reproduce the FASTA benchmark:

cargo build --release -p biors-core --example benchmark_fasta
python3 -m venv .venv-bench
. .venv-bench/bin/activate
pip install biopython
python scripts/benchmark_fasta_vs_biopython.py
cat benchmarks/fasta_vs_biopython.json

The benchmark script updates both benchmarks/fasta_vs_biopython.json and benchmarks/fasta_vs_biopython.md. scripts/check-benchmark-docs.sh verifies that the Markdown report still matches the JSON artifact.

Compare two benchmark artifacts:

python scripts/compare-benchmark-artifacts.py before.json after.json

Run the Rust library example:

cargo run -p biors-core --example tokenize

Workspace

packages/
  rust/
    biors/       CLI
    biors-core/  Core engine + contracts

schemas/
  batch-validation-output.v0.json
  cli-error.v0.json
  cli-success.v0.json
  fasta-validation-output.v0.json
  inspect-output.v0.json
  model-input-output.v0.json
  output-diff.v0.json
  pipeline-output.v0.json
  pipeline-config.v0.json
  pipeline-lock.v0.json
  sequence-workflow-output.v0.json
  sequence-debug-output.v0.json
  package-bridge-output.v0.json
  package-inspect-output.v0.json
  package-manifest.v0.json
  package-manifest.v1.json
  package-validation-report.v0.json
  package-verify-output.v0.json
  tokenizer-inspect-output.v0.json
  tokenize-output.v0.json

examples/
  protein.fasta
  multi.fasta
  model-input-contract/
    protein-20-special.config.json
    protein-20-special.expected.json
    reference-python-parity.json
  python/
    esm_from_biors_json.py
    pandas_numpy_friendly.py
    protbert_from_biors_json.py
    reference_preprocess.py
  protein-package/
    models/
    docs/
    manifest.json
    observations.json
    fixtures/
    observed/
    tokenizers/
    vocabs/
    pipelines/
  pipeline/
    protein.toml
    protein.yaml
    protein.json
    pipeline.lock

Protein-20 alphabet

A C D E F G H I K L M N P Q R S T V W Y

Token IDs follow that order, starting at 0.

Contributing

See CONTRIBUTING.md for local setup, checks, and PR expectations.

License

Dual licensed under MIT OR Apache-2.0. If you use bio-rs in research software or publications, cite the repository and version via CITATION.cff.

biors 0.34.0