Skip to main content

Crate serrf

Crate serrf 

Source
Expand description

SERRF normalization for metabolomics data.

This crate implements Systematic Error Removal using Random Forests (SERRF), a method for correcting measurement drift and batch effects in untargeted metabolomics LC-MS datasets. A random forest is trained per compound on quality-control (QC) samples to model the relationship between systematic error and injection order, batch, and (optionally) correlated compounds. The learned model is then applied to all samples to remove the predicted drift component while preserving biological signal.

§Quick start

use serrf::{load_input, normalize, SerrfConfig};
use std::fs::File;

let input = load_input(
    File::open("intensities.csv")?,
    File::open("metadata.csv")?,
)?;

let output = normalize(&input, &SerrfConfig::default())?;

println!(
    "Normalized {}/{} compounds. Median QC RSD: {:.3} -> {:.3}",
    output.report.compounds_normalized,
    output.report.compounds_total,
    output.report.median_qc_train_rsd_before.unwrap_or(f64::NAN),
    output.report.median_qc_train_rsd_after.unwrap_or(f64::NAN),
);

§Input format

Intensity CSV - first column is compound ID; remaining columns are sample IDs. Missing values are empty cells (stored as NaN internally).

Metadata CSV - required columns: sample_id, sample_type (qc, sample, or validate), batch, injection_order. Sample IDs must match intensity columns exactly.

§Configuration

SerrfConfig controls all tunable parameters. The defaults are reasonable starting points; the most impactful options are:

FieldEffect
n_treesForest size; more trees = more stable, slower
log_spaceTrain on log-transformed intensities; recommended for log-normal data
use_correlated_compoundsAdd other compound intensities as features; increases runtime
qc_validation_fractionFraction of QC samples held out to measure normalization quality

§Output

normalize returns a SerrfOutput containing:

  • intensities - normalized intensity matrix (same shape as input; skipped compounds carry raw values)
  • report - SerrfReport with per-compound statistics, RSD improvement ratios, and any warnings

Structs§

CompoundReport
Per-compound normalization outcome and quality metrics.
IntensityMatrix
Intensity matrix with compound and sample identifiers.
SampleMeta
Per-sample metadata required for normalization.
SerrfConfig
Algorithm configuration passed to normalize.
SerrfInput
Combined intensity matrix and sample metadata.
SerrfOutput
Normalized intensities and a summary report produced by normalize.
SerrfReport
Summary statistics produced by a normalization run.

Enums§

SampleType
Classification of a sample’s role in the experiment.
SerrfError
Errors that can be produced by serrf operations.
SerrfWarning
A warning emitted during normalization.
SkipReason
Reason a compound was not normalized.

Functions§

improvement_ratio
improvement_ratio = before / after. Values > 1.0 mean improvement. Returns None when either side is missing or abs(after) < epsilon.
load_input
Load and join an intensity CSV and a metadata CSV into a SerrfInput.
median
Median of finite values. Returns None when no finite values.
normalize
Normalize an intensity matrix using Systematic Error Removal using Random Forests (SERRF).
normalize_with_callback
Like normalize, but invokes on_compound once after each compound is processed.
parse_intensities
Parse an intensity CSV into an IntensityMatrix.
parse_metadata
Parse a metadata CSV into a list of SampleMeta entries.
rsd
RSD = sample_stddev(finite values) / abs(mean(finite values)). Uses n-1 (Bessel’s correction). Returns None when fewer than 2 finite values, mean is non-finite, or abs(mean) < epsilon.
write_intensities
Write an IntensityMatrix as a CSV.