biors 0.5.0

Command-line tools for bio-rs biological AI model input workflows.
biors-0.5.0 is not a library.

bio-rs

CI License: MIT/Apache-2.0

Rust tools for validating protein FASTA input and tokenizing FASTA records into stable protein-20 token ids.

Features

  • FASTA parsing for one or more protein sequences
  • protein-20 residue validation
  • lowercase sequence normalization
  • ambiguous residue reporting for X, B, Z, J, U, and O
  • invalid residue reporting
  • JSON array output from the CLI

Quickstart

Inspect FASTA records:

cargo run -p biors -- inspect examples/protein.fasta

Tokenize FASTA records:

cargo run -p biors -- tokenize examples/protein.fasta

Tokenize FASTA records from stdin:

cat examples/protein.fasta | cargo run -p biors -- tokenize -

Tokenize a multi-record FASTA file:

cargo run -p biors -- tokenize examples/multi.fasta

Use the Rust library:

[dependencies]
biors-core = "0.5.0"
use biors_core::{summarize_tokenized_proteins, tokenize_fasta_records};

let tokenized = tokenize_fasta_records(">seq1\nACDE\n")?;
let summary = summarize_tokenized_proteins(&tokenized);

assert_eq!(summary.records, 1);
assert_eq!(tokenized[0].tokens, vec![0, 1, 2, 3]);
# Ok::<(), Box<dyn std::error::Error>>(())

JSON Contracts

tokenize always emits an array of records:

[
  {
    "id": "seq1",
    "length": 4,
    "alphabet": "protein-20",
    "valid": true,
    "tokens": [0, 1, 2, 3],
    "warnings": [],
    "errors": []
  }
]

inspect always emits a summary object:

{
  "records": 1,
  "total_length": 4,
  "valid_records": 1,
  "warning_count": 0,
  "error_count": 0
}

Checks

scripts/check.sh

The check suite runs cargo fmt, native Rust checks, a biors-core wasm32-unknown-unknown build check, tests, and cargo clippy with warnings denied.

Run the Rust library example:

cargo run -p biors-core --example tokenize

Workspace

packages/
  rust/
    biors/       CLI
    biors-core/  FASTA parsing and tokenization library
examples/
  multi.fasta
  protein.fasta

Protein-20

A C D E F G H I K L M N P Q R S T V W Y

Token ids follow that order, starting at 0.

License

Dual licensed under MIT OR Apache-2.0.