Expand description
SERRF normalization for metabolomics data.
This crate implements Systematic Error Removal using Random Forests (SERRF), a method for correcting measurement drift and batch effects in untargeted metabolomics LC-MS datasets. A random forest is trained per compound on quality-control (QC) samples to model the relationship between systematic error and injection order, batch, and (optionally) correlated compounds. The learned model is then applied to all samples to remove the predicted drift component while preserving biological signal.
§Quick start
use serrf::{load_input, normalize, SerrfConfig};
use std::fs::File;
let input = load_input(
File::open("intensities.csv")?,
File::open("metadata.csv")?,
)?;
let output = normalize(&input, &SerrfConfig::default())?;
println!(
"Normalized {}/{} compounds. Median QC RSD: {:.3} -> {:.3}",
output.report.compounds_normalized,
output.report.compounds_total,
output.report.median_qc_train_rsd_before.unwrap_or(f64::NAN),
output.report.median_qc_train_rsd_after.unwrap_or(f64::NAN),
);§Input format
Intensity CSV - first column is compound ID; remaining columns are sample IDs.
Missing values are empty cells (stored as NaN internally).
Metadata CSV - required columns: sample_id, sample_type (qc, sample, or
validate), batch, injection_order. Sample IDs must match intensity columns exactly.
§Configuration
SerrfConfig controls all tunable parameters. The defaults are reasonable starting
points; the most impactful options are:
| Field | Effect |
|---|---|
n_trees | Forest size; more trees = more stable, slower |
log_space | Train on log-transformed intensities; recommended for log-normal data |
use_correlated_compounds | Add other compound intensities as features; increases runtime |
qc_validation_fraction | Fraction of QC samples held out to measure normalization quality |
§Output
normalize returns a SerrfOutput containing:
intensities- normalized intensity matrix (same shape as input; skipped compounds carry raw values)report-SerrfReportwith per-compound statistics, RSD improvement ratios, and any warnings
Structs§
- Compound
Report - Per-compound normalization outcome and quality metrics.
- Intensity
Matrix - Intensity matrix with compound and sample identifiers.
- Sample
Meta - Per-sample metadata required for normalization.
- Serrf
Config - Algorithm configuration passed to
normalize. - Serrf
Input - Combined intensity matrix and sample metadata.
- Serrf
Output - Normalized intensities and a summary report produced by
normalize. - Serrf
Report - Summary statistics produced by a normalization run.
Enums§
- Sample
Type - Classification of a sample’s role in the experiment.
- Serrf
Error - Errors that can be produced by serrf operations.
- Serrf
Warning - A warning emitted during normalization.
- Skip
Reason - Reason a compound was not normalized.
Functions§
- improvement_
ratio - improvement_ratio = before / after. Values > 1.0 mean improvement. Returns None when either side is missing or abs(after) < epsilon.
- load_
input - Load and join an intensity CSV and a metadata CSV into a
SerrfInput. - median
- Median of finite values. Returns None when no finite values.
- normalize
- Normalize an intensity matrix using Systematic Error Removal using Random Forests (SERRF).
- normalize_
with_ callback - Like
normalize, but invokeson_compoundonce after each compound is processed. - parse_
intensities - Parse an intensity CSV into an
IntensityMatrix. - parse_
metadata - Parse a metadata CSV into a list of
SampleMetaentries. - rsd
- RSD = sample_stddev(finite values) / abs(mean(finite values)). Uses n-1 (Bessel’s correction). Returns None when fewer than 2 finite values, mean is non-finite, or abs(mean) < epsilon.
- write_
intensities - Write an
IntensityMatrixas a CSV.