Crate serrf

Expand description

SERRF normalization for metabolomics data.

This crate implements Systematic Error Removal using Random Forests (SERRF), a method for correcting measurement drift and batch effects in untargeted metabolomics LC-MS datasets. A random forest is trained per compound on quality-control (QC) samples to model the relationship between systematic error and injection order, batch, and (optionally) correlated compounds. The learned model is then applied to all samples to remove the predicted drift component while preserving biological signal.

§Quick start

use serrf::{load_input, normalize, SerrfConfig};
use std::fs::File;

let input = load_input(
    File::open("intensities.csv")?,
    File::open("metadata.csv")?,
)?;

let output = normalize(&input, &SerrfConfig::default())?;

println!(
    "Normalized {}/{} compounds. Median QC RSD: {:.3} -> {:.3}",
    output.report.compounds_normalized,
    output.report.compounds_total,
    output.report.median_qc_train_rsd_before.unwrap_or(f64::NAN),
    output.report.median_qc_train_rsd_after.unwrap_or(f64::NAN),
);

§Input format

Intensity CSV - first column is compound ID; remaining columns are sample IDs. Missing values are empty cells (stored as NaN internally).

Metadata CSV - required columns: sample_id, sample_type (qc, sample, or validate), batch, injection_order. Sample IDs must match intensity columns exactly.

§Configuration

SerrfConfig controls all tunable parameters. The defaults are reasonable starting points; the most impactful options are:

Field	Effect
`n_trees`	Forest size; more trees = more stable, slower
`log_space`	Train on log-transformed intensities; recommended for log-normal data
`use_correlated_compounds`	Add other compound intensities as features; increases runtime
`qc_validation_fraction`	Fraction of QC samples held out to measure normalization quality

§Output

normalize returns a SerrfOutput containing:

intensities - normalized intensity matrix (same shape as input; skipped compounds carry raw values)
report - SerrfReport with per-compound statistics, RSD improvement ratios, and any warnings

Structs§

CompoundReport: Per-compound normalization outcome and quality metrics.
IntensityMatrix: Intensity matrix with compound and sample identifiers.
SampleMeta: Per-sample metadata required for normalization.
SerrfConfig: Algorithm configuration passed to normalize.
SerrfInput: Combined intensity matrix and sample metadata.
SerrfOutput: Normalized intensities and a summary report produced by normalize.
SerrfReport: Summary statistics produced by a normalization run.

Enums§

SampleType: Classification of a sample’s role in the experiment.
SerrfError: Errors that can be produced by serrf operations.
SerrfWarning: A warning emitted during normalization.
SkipReason: Reason a compound was not normalized.

Functions§

improvement_ratio: improvement_ratio = before / after. Values > 1.0 mean improvement. Returns None when either side is missing or abs(after) < epsilon.
load_input: Load and join an intensity CSV and a metadata CSV into a SerrfInput.
median: Median of finite values. Returns None when no finite values.
normalize: Normalize an intensity matrix using Systematic Error Removal using Random Forests (SERRF).
normalize_with_callback: Like normalize, but invokes on_compound once after each compound is processed.
parse_intensities: Parse an intensity CSV into an IntensityMatrix.
parse_metadata: Parse a metadata CSV into a list of SampleMeta entries.
rsd: RSD = sample_stddev(finite values) / abs(mean(finite values)). Uses n-1 (Bessel’s correction). Returns None when fewer than 2 finite values, mean is non-finite, or abs(mean) < epsilon.
write_intensities: Write an IntensityMatrix as a CSV.

Crate serrf

Crate serrf Copy item path

§Quick start

§Input format

§Configuration

§Output

Structs§

Enums§

Functions§

Crate serrf