rsomics-bioenv 0.1.0

BIO-ENV / BEST — subset of environmental variables maximally rank-correlated with a community distance matrix (Clarke & Ainsworth 1993), scikit-bio compatible
Documentation
# rsomics-bioenv

BIO-ENV / BEST — find the subset of environmental variables whose standardized
Euclidean distances are maximally rank-correlated with a community distance
matrix (Clarke & Ainsworth 1993). Drop-in compatible with
`skbio.stats.distance.bioenv`.

```
rsomics-bioenv dm.tsv --env env.tsv [--columns a,b,c] [-o result.tsv]
```

`dm.tsv` is an lsmat-format community distance matrix (a blank top-left corner,
a tab-separated id header, then one `id<TAB>values…` row per sample). `env.tsv`
is a samples × variables table: a header `<id-label><TAB>var1<TAB>var2…`, then
one `sampleid<TAB>v1<TAB>v2…` row per sample. Env rows are reindexed onto the
distance-matrix ids, so the ids must match but need not be in the same order;
extra env rows are ignored. All variable values must be numeric.

For each subset size from 1 to the number of variables, every variable subset
of that size is evaluated and the one with the highest correlation is reported.
The variables are standardized first (centered, divided by the sample standard
deviation), their Euclidean distances are computed over the matrix's upper
triangle, and Spearman's ρ is taken against the community distances. Output is a
TSV of `size`, the best `correlation`, and the comma-joined `vars`.

This is an exhaustive 2^p search, so runtime grows quickly with the variable
count — the same warning scikit-bio gives.

## Origin

This crate is an independent Rust reimplementation of
`skbio.stats.distance.bioenv`, informed by its BSD-3-licensed source (the
center-and-scale standardization with sample standard deviation, the
upper-triangle condensed Euclidean distances, the per-subset-size exhaustive
search, and the "first subset on a tie" rule matching `vegan::bioenv`) and by
the method's primary reference:

- Clarke, K. R. & Ainsworth, M. (1993). "A method of linking multivariate
  community structure to environmental variables." *Marine Ecology Progress
  Series* 92: 205–219. doi:10.3354/meps092205.

Spearman's ρ is computed as Pearson correlation on average-ranked distances,
matching `scipy.stats.spearmanr`; the community ranks are centered once and
reused across all subsets.

License: MIT OR Apache-2.0.
Upstream credit: scikit-bio <https://scikit-bio.org> (BSD-3-Clause).