biors 0.12.3

Command-line tools for bio-rs biological AI model input workflows.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
# bio-rs

[![CI](https://github.com/bio-rs/bio-rs/workflows/CI/badge.svg)](https://github.com/bio-rs/bio-rs/actions)
[![Release](https://github.com/bio-rs/bio-rs/actions/workflows/release.yml/badge.svg)](https://github.com/bio-rs/bio-rs/actions/workflows/release.yml)
[![Benchmark](https://img.shields.io/badge/benchmark-UniProt%20FASTA-blue)](benchmarks/fasta_vs_biopython.md)
[![Contracts](https://img.shields.io/badge/contracts-JSON%20v0-blue)](docs/public-contract-1.0-candidates.md)
[![License: MIT/Apache-2.0](https://img.shields.io/badge/License-MIT%2FApache--2.0-blue.svg)](LICENSE-MIT)

bio-rs turns biological sequences into validated, model-ready inputs for bio-AI workflows.

```txt
FASTA -> validated protein sequence -> token ids -> model-ready JSON
```

> Status: pre-1.0 CLI and JSON contract stabilization.

## Why bio-rs?

Most bio-AI models are born in Python, but the tooling around them often needs to run somewhere else:

- local CLIs
- CI pipelines
- servers
- browsers
- agents

bio-rs focuses on the boring but important layer before inference:

- parse biological sequence input
- validate it with structured diagnostics
- tokenize it into stable IDs
- emit machine-readable JSON contracts
- keep preprocessing reproducible outside notebooks

The goal is not to replace Python research workflows.

The goal is to make the input layer around bio-AI models faster, more portable, and easier to trust.

## Quickstart

Install the published CLI:

```bash
cargo install biors --version 0.12.3
biors --version
```

Tokenize a FASTA file:

```bash
biors tokenize examples/protein.fasta
```

Pipe FASTA through stdin:

```bash
printf '>tiny\nACDE\n' | biors tokenize -
```

Validate FASTA:

```bash
biors fasta validate examples/protein.fasta
```

Verify package fixture outputs:

```bash
biors package verify \
  examples/protein-package/manifest.json \
  examples/protein-package/observations.json
```

Build model-ready input records:

```bash
biors model-input --max-length 8 examples/protein.fasta
```

## Proof

bio-rs keeps performance claims tied to reproducible in-repo benchmarks.

Latest recorded FASTA benchmark baseline:

| Dataset | Matched workload | bio-rs core mean | Biopython mean | bio-rs speedup |
|---|---|---:|---:|---:|
| Human proteome | Parse + validation | **0.041s** | 0.460s | **11.33x** |
| Human proteome | Parse + tokenization | **0.085s** | 0.459s | **5.42x** |
| 100MB+ FASTA | Parse + validation | **0.323s** | 4.134s | **12.81x** |
| 100MB+ FASTA | Parse + tokenization | **0.741s** | 4.145s | **5.59x** |
| Many short records | Parse + validation | **0.009s** | 0.058s | **6.73x** |
| Many short records | Parse + tokenization | **0.018s** | 0.059s | **3.32x** |
| Single long sequence | Parse + validation | **0.007s** | 0.035s | **4.97x** |
| Single long sequence | Parse + tokenization | **0.009s** | 0.035s | **3.76x** |

Benchmark details:

- Datasets:
  - UniProt human reference proteome (`UP000005640`, `9606`)
  - 100MB+ large FASTA generated by repeating the same real proteome to isolate large-input throughput
  - 20,000 short 48-residue records generated from the same proteome residue stream
  - one 960,000-residue sequence generated from the same proteome residue stream
- Matched workloads:
  - pure parse
  - parse plus validation
  - parse plus tokenization
- Current best recorded raw throughput:
  - human proteome parse + validation: `282.4M residues/s`, `322.9 MB/s`
  - 100MB+ FASTA parse + validation: `319.6M residues/s`, `365.4 MB/s`
  - human proteome parse + tokenization: `135.3M residues/s`, `154.7 MB/s`
  - 100MB+ FASTA parse + tokenization: `139.1M residues/s`, `159.0 MB/s`
- Benchmark doc: [benchmarks/fasta_vs_biopython.md]benchmarks/fasta_vs_biopython.md
- Benchmark script: [scripts/benchmark_fasta_vs_biopython.py]scripts/benchmark_fasta_vs_biopython.py

This benchmark measures `biors-core` directly and excludes CLI startup and JSON
serialization overhead. It is still workload-specific, not a broad claim that
bio-rs is faster than Biopython across every FASTA workload or researcher input
shape.

## What works today

`biors-core` provides the Rust engine and data contracts.

`biors` provides the CLI surface.

Current capabilities:

- FASTA parsing and normalization
- shared FASTA parser/tokenizer scanner with an ASCII fast path and Unicode fallback
- buffered reader APIs for FASTA parse/validate/tokenize paths
- FASTA validation with line and record-index diagnostics
- FASTA record identifier validation
- protein-20 tokenization
- JSON vocab loading for tokenizer contracts
- positional token alignment preserved with explicit unknown-token IDs for unresolved residues
- residue warning/error reporting
- model-ready input records
- attention masks
- padding/truncation policy
- `model-input` CLI output
- model-input safety checks for unresolved residues
- explicit checked and unchecked model-input builders
- writer-based CLI success JSON serialization to reduce peak allocations for large outputs
- package manifest inspect/validate
- typed package validation issue codes
- typed package manifest enums for schema version, model format, runtime target, and tensor dtypes
- runtime bridge planning reports
- manifest-relative asset validation
- package path escape rejection for manifest and observation assets
- SHA-256 package and fixture checksum verification
- package fixture verification from observed artifact paths
- structured package fixture mismatch issue codes and first-difference reports
- committed FASTA, tokenizer, manifest, and verification fixtures
- JSON success/error envelopes

## CLI examples

Inspect FASTA records:

```bash
cargo run -p biors -- inspect examples/protein.fasta
```

Tokenize FASTA records:

```bash
cargo run -p biors -- tokenize examples/protein.fasta
```

Tokenize a multi-record FASTA file:

```bash
cargo run -p biors -- tokenize examples/multi.fasta
```

Validate FASTA records:

```bash
cargo run -p biors -- fasta validate examples/protein.fasta
```

Emit structured JSON errors:

```bash
printf 'ACDE\n' | cargo run -p biors -- --json tokenize -
```

Build model-ready input records:

```bash
cargo run -p biors -- model-input --max-length 4 examples/protein.fasta
```

Inspect a package manifest:

```bash
cargo run -p biors -- package inspect examples/protein-package/manifest.json
```

Validate a package manifest:

```bash
cargo run -p biors -- package validate examples/protein-package/manifest.json
```

Plan a runtime bridge from a package manifest:

```bash
cargo run -p biors -- package bridge examples/protein-package/manifest.json
```

Verify package fixture observations:

```bash
cargo run -p biors -- package verify \
  examples/protein-package/manifest.json \
  examples/protein-package/observations.json
```

`package verify` expects the observations file to point at observed output artifact paths:

```json
[
  {
    "name": "tiny-protein",
    "path": "observed/tiny.output.json"
  }
]
```

## JSON contracts

Success output uses a stable envelope shape:

```json
{
  "ok": true,
  "biors_version": "0.x.y",
  "input_hash": "fnv1a64:846a502e5067bc21",
  "data": {}
}
```

FASTA-backed commands keep `input_hash` in the legacy `fnv1a64:` format for backward compatibility. Package artifacts and fixture hashes use `sha256:` in manifests and verification reports.

`--json` error mode emits structured errors:

```json
{
  "ok": false,
  "error": {
    "code": "fasta.missing_header",
    "message": "FASTA input must start with a header line beginning with '>' at line 1",
    "location": {
      "line": 1,
      "record_index": null
    }
  }
}
```

Tokenization output is record-oriented:

```json
[
  {
    "id": "seq1",
    "length": 4,
    "alphabet": "protein-20",
    "valid": true,
    "tokens": [0, 1, 2, 3],
    "warnings": [],
    "errors": []
  }
]
```

Public contract docs:

- [Quickstart]docs/quickstart.md
- [Professional readiness]docs/professional-readiness.md
- [CLI contract]docs/cli-contract.md
- [Error code registry]docs/error-codes.md
- [1.0 contract candidates]docs/public-contract-1.0-candidates.md
- [1.0 release candidate path]docs/release-candidate-1.0.md
- [API and schema review]docs/api-review.md
- [MSRV policy draft]docs/msrv.md
- [Versioning policy]docs/versioning.md
- [JSON schemas]schemas
- [Citation metadata]CITATION.cff

## Release history

Delivered:

- `0.12.3`: byte-buffered FASTA reader scanning, static residue/token lookup tables, and expanded shape-profile benchmark proof assets
- `0.12.2`: published CLI `--version` support and version-verification docs for reproducible installs
- `0.12.1`: release workflow publish-order guard, published CLI quickstart, professional-readiness audit, and summary-only FASTA inspect path
- `0.12.0`: release-candidate documentation, full workflow e2e coverage, MSRV/citation policy drafts, and release notes
- `0.11.0`: benchmark reproducibility metadata, generated benchmark report checks, and refreshed speed/memory proof assets
- `0.10.0`: fixture and verification hardening with shared byte-aware FASTA scanning, tokenizer invariants, and structured mismatch reports
- `0.9.8`: tokenization lookup and CLI JSON writer performance improvements with refreshed reader-based benchmarks
- `0.9.7`: buffered FASTA reader APIs, typed package validation issues, CLI module refactor, and explicit model-input builder safety
- `0.9.6`: FASTA identifier validation, model-input policy validation, package path escape rejection, and JSON vocab loading
- `0.9.5`: core-throughput benchmark harness, matched-workload benchmark refresh, workflow/cache tightening, and git-hook install helper
- `0.9.4`: tokenizer positional alignment preservation, FASTA single-pass tokenization/validation path, typed package manifest enums, and benchmark refresh
- `0.9.3`: release workflow fix for automatic GitHub Release creation after crates publish
- `0.9.2`: model-input safety hardening for unresolved residues and automated GitHub Release creation
- `0.9.1`: model-input CLI, checksum-backed package validation, benchmark refresh, and contract hardening
- `0.9.0`: CLI and JSON contract freeze baseline
- `0.8.1`: documentation, contribution guide, and benchmark baseline hardening
- `0.8.0`: fixture verification with `package verify`
- `0.7.0`: runtime bridge planning with `package bridge`
- `0.6.0`: package manifest inspect/validate

Next:

- first stable release: stable public contracts and runtime-facing APIs after enough real-world package validation

## Not yet

These are roadmap directions, not current capabilities:

- hosted web workflows
- Python bindings
- model inference backends
- package registry or plugin ecosystem
- general-purpose chemistry tooling
- structure tooling
- no-code or low-code workflows

## Development

Run checks:

```bash
scripts/check.sh
```

Run the faster local commit gate:

```bash
scripts/check-fast.sh
```

The check suite runs:

- `cargo fmt`
- shell and Python syntax checks for repo scripts
- benchmark Markdown regeneration check
- release workflow publish-order invariant check
- Rust checks
- `biors-core` `wasm32-unknown-unknown` build check
- tests
- `cargo clippy` with warnings denied

Reproduce the FASTA benchmark:

```bash
cargo build --release -p biors-core --example benchmark_fasta
python3 -m venv .venv-bench
. .venv-bench/bin/activate
pip install biopython
python scripts/benchmark_fasta_vs_biopython.py
cat benchmarks/fasta_vs_biopython.json
```

The benchmark script updates both `benchmarks/fasta_vs_biopython.json` and
`benchmarks/fasta_vs_biopython.md`. `scripts/check-benchmark-docs.sh` verifies
that the Markdown report still matches the JSON artifact.

Run the Rust library example:

```bash
cargo run -p biors-core --example tokenize
```

## Workspace

```txt
packages/
  rust/
    biors/       CLI
    biors-core/  Core engine + contracts

schemas/
  cli-error.v0.json
  cli-success.v0.json
  fasta-validation-output.v0.json
  inspect-output.v0.json
  model-input-output.v0.json
  package-bridge-output.v0.json
  package-inspect-output.v0.json
  package-manifest.v0.json
  package-validation-report.v0.json
  package-verify-output.v0.json
  tokenize-output.v0.json

examples/
  protein.fasta
  multi.fasta
  protein-package/
    models/
    manifest.json
    observations.json
    fixtures/
    observed/
    tokenizers/
    vocabs/
```

## Protein-20 alphabet

```txt
A C D E F G H I K L M N P Q R S T V W Y
```

Token IDs follow that order, starting at `0`.

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for local setup, checks, and PR expectations.

## License

Dual licensed under MIT OR Apache-2.0. If you use bio-rs in research software
or publications, cite the repository and version via [CITATION.cff](CITATION.cff).