openproteo-core 0.1.0

Shared types, traits, and mzML writer for open Rust mass-spec parsers.
Documentation
# OpenProteo Roadmap

`openproteo-core` is the shared foundation for the open vendor parsers
(`opentfraw`, `opentimstdf`, `openwraw`). Each vendor crate stays
standalone (a user can pull in `opentfraw` alone and get mzML export);
the core crate defines the shared vocabulary, the `SpectrumSource`
trait, the canonical mzML writer, an Arrow bridge, and the cross-vendor
conformance harness.

`openproteo-io` is the umbrella that ties the three vendor parsers
together behind a uniform `detect_format` + `convert_to_mzml` API and
ships the `vendor2mzml` CLI.

This file is tracked in git and ships with the crate so external
contributors can see where the project is heading. Edit freely; the
plan evolves with each phase.

Status legend: `[x]` shipped, `[~]` in progress, `[ ]` planned.

## Phase 0 - Foundation - SHIPPED

- [x] Core types, `SpectrumSource` trait, canonical mzML 1.1.0 writer

## Phase 1 - Vendor alignment - SHIPPED

- [x] opentfraw, opentimstdf, openwraw all emit mzML through
      `openproteo-core::write_mzml` / `write_indexed_mzml`

## Phase 2 - SDK consolidation - SHIPPED

- [x] **2a Waters parity** (`OpenWRaw`): polarity from `_extern.inf`,
      MS-level from `FunctionMode` (MSe split into MS1/MS2 by index),
      IMS drift pass-through via `inv_mobility_per_peak`.
- [x] **2b Arrow bridge** (`openproteo-core`, `arrow` feature):
      `spectrum_record_schema()` + `SpectrumBatchBuilder` with
      `LargeList<Float64/Float32>` peak columns and a flat precursor
      block. Default builds remain zero-dep.
- [x] **2c Conformance harness** (`openproteo-core::conformance`):
      `assert_source_invariants` checks peak-array length parity,
      TIC + base-peak consistency, MS2 precursor presence, RT
      monotonicity per native-ID stream, and index sequencing. Wired
      into all three vendor crates via `tests/conformance.rs`.
      Validated on 3047 (Thermo), 24425 (Bruker TIMS), 590 (Waters
      IMS) spectra.
- [x] **2d Umbrella + CLI** (`OpenProteo`): `openproteo-io` library
      with feature-gated `thermo / bruker / waters / all` vendor
      re-exports, runtime `detect_format()` probe (Thermo Finnigan
      magic at offset 2, Bruker `analysis.tdf+_bin` in `.d/`, Waters
      `_HEADER.TXT` in `.raw/`), `convert_to_mzml()` dispatch. CLI
      `vendor2mzml <input> <output.mzML> [--indexed]`. End-to-end
      smoke test covers all three vendors (12 MB / 564 MB / 26 MB
      mzML).

### Carry-overs

- `openproteo-io-py` (PyO3 bindings) - deferred into Phase 3.

## Phase 3 - Distribution + ergonomics (PROPOSED WORK PACKAGE)

The Rust + mzML pipeline is now complete and validated. Phase 3 is
about making it usable from outside Rust and shippable to end users.

### 3a. `openproteo-io-py` (Python bindings)

Single PyO3 module that exposes the umbrella, not three separate
vendor modules. PEP 503 distribution name `openproteo-io`
(import name `openproteo_io`).

- [ ] `openproteo_io.detect(path: str) -> VendorFormat`
- [ ] `openproteo_io.to_mzml(input: str, output: str, *, indexed: bool=True)`
- [ ] `openproteo_io.iter_spectra(path: str) -> Iterator[Spectrum]`
      yielding objects with **zero-copy NumPy views** over the m/z,
      intensity, and (optional) inverse-mobility arrays via
      `numpy::PyArray1::from_slice_bound`.
- [ ] `openproteo_io.read_arrow(path: str) -> pyarrow.RecordBatchReader`
      backed by `SpectrumBatchBuilder` (requires the `arrow` feature).
- [ ] maturin build, abi3-py39 wheels, manylinux2014.
- [ ] pytest smoke (Thermo / Bruker / Waters round-trips).

### 3b. CLI polish (SHIPPED)

`vendor2mzml` is now a clap-derive binary with `convert` + `info`
sub-commands, validation, and gzip output.

- [x] Replace ad-hoc arg parsing with `clap` derive.
- [x] `--validate` post-pass running the conformance harness on the
      source before writing (skip-write on failure, exit 3).
- [x] `--profile json|text` emitting timing + record counts to stderr.
- [x] Auto-detect output format from extension (`.mzML.gz` -> gzip via
      flate2).
- [x] `vendor2mzml info <input>` sub-command: prints vendor, instrument,
      polarity, MS-level breakdown, spectrum count, RT range; `--json`
      for machine-readable output.
- [x] `--version` reports `vendor2mzml` + `openproteo-core` versions.

### 3c. Release engineering (SHIPPED workflows; publish deferred)

GitHub Actions workflows are in place for both repos. Tag a release
with `vX.Y.Z` and the workflows attach `vendor2mzml` archives + Python
wheels to the GitHub Release. Publish to crates.io / PyPI is
intentionally deferred per Phase 3 decision.

- [ ] Publish `openproteo-core` 0.1.0 to crates.io. *(deferred)*
- [ ] Publish `openproteo-io` 0.1.0 to crates.io once the vendor
      crates' new tags are out. *(deferred)*
- [x] GitHub Actions: per-repo CI (rustfmt + clippy + test + msrv 1.75),
      release workflow uploading `vendor2mzml` binaries for linux-x64,
      linux-aarch64, macos-x64, macos-arm64, windows-x64.
- [x] Python wheel workflow (maturin, abi3-py39) covering the same
      five targets plus an sdist.
- [ ] Re-validate byte-identical mzML diff on the PXD058812 Waters
      baseline after 2a landed (sanity check). *(open)*
- [ ] CHANGELOGs synced across the five repos. *(open)*

### 3d. Documentation (SHIPPED)

`OpenProteo/docs/` is an mdBook with chapters for install, three
quickstarts (CLI / Rust / Python), the format-detection rules, the
conformance contract, the Arrow schema, per-vendor coverage notes,
and the architecture / crate-layout / pure-Rust design rationale.

- [x] `OpenProteo/docs/` mdBook: quickstart for Rust + Python +
      `vendor2mzml`, format-detection notes, conformance contract,
      design rationale.
- [x] Per-vendor chapter listing covered MassLynx / Finnigan / TDF
      features and known gaps.

## Phase 4 - Performance + scale (later)

- Criterion bench suite gated on PRs.
- Rayon intra-file ingest for indexed writes.
- Async / object-store readers (S3, GCS) behind `openproteo-core::io`.
- Streaming Arrow writes to Parquet (DataFusion compatibility).
- Lance corpus exporter via `openproteo-core::arrow` + lance-rs.

## Phase 5+ - Search-engine bridge (later)

- `ProLance` integration via one `SpectrumSource`-consuming ingester.
- Sage / Comet-rs adapter behind `openproteo-io::search`.
- Spectral library export (BLIB, MSP).

## Design invariants

1. Each vendor crate is self-contained. `openproteo-core` is a leaf
   dependency, never a runtime hub.
2. `SpectrumRecord` is the only handshake.
3. RT in seconds; mzML writes minutes at the boundary.
4. `native_id` is verbatim in the vendor's canonical format.
5. No `unsafe` anywhere in the suite (`unsafe_code = "forbid"`).