gam 0.1.14

Generalized penalized likelihood engine
docs.rs failed to build gam-0.1.14
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Visit the last successful build: gam-0.1.11

gam

gam is a formula-first CLI and Rust engine for penalized regression models.

The current CLI supports:

  • Standard mean models with penalized linear terms, random effects, and smooths
  • A surfaced location-scale fitting path via --predict-noise
  • Survival models via Surv(entry, exit, event)
  • An advanced Bernoulli marginal-slope workflow via --logslope-formula and --z-column
  • Prediction, HTML reports, ALO diagnostics, posterior sampling, and synthetic-data generation

The CLI is the primary interface. The Rust modules exported by the crate are used internally and can change without compatibility guarantees.

Requirements

  • Rust 1.93+ for local source builds
  • CSV input data with a header row
  • A shell that supports quoted formulas (bash/zsh examples below)

Install

Prebuilt binary

macOS, Linux, and Windows Git Bash:

curl -fsSL https://raw.githubusercontent.com/SauersML/gam/main/install.sh | bash

Build from source

cargo build --release --bin gam
./target/release/gam --help

If gam is not on your PATH, use ./target/release/gam in the examples below.

Command Overview

Command Purpose Required arguments Output
gam fit Fit a model <DATA> <FORMULA> No file unless --out is provided
gam predict Score new data <MODEL> <NEW_DATA> plus --out Prediction CSV
gam report Build a standalone HTML report <MODEL> [DATA] [OUT] [OUT] or <model-stem>.report.html in the current directory
gam diagnose Run terminal diagnostics <MODEL> <DATA> Prints a diagnostics table
gam sample Draw posterior samples <MODEL> <DATA> posterior_samples.csv by default, plus a summary CSV
gam generate Sample synthetic outcomes conditional on input rows <MODEL> <DATA> synthetic.csv by default

Aliases:

  • gam train -> gam fit
  • gam simulate -> gam generate

Inspect full options with:

gam <command> --help

Verified Quickstart

These commands were checked against the current binary and the checked-in lidar dataset.

# 1) Fit a Gaussian GAM
gam fit bench/datasets/lidar.csv \
  'logratio ~ smooth(range)' \
  --out lidar.model.json

# 2) Predict with uncertainty
gam predict lidar.model.json bench/datasets/lidar.csv \
  --out lidar.pred.csv --uncertainty

# 3) Build an HTML report
gam report lidar.model.json bench/datasets/lidar.csv
# writes: lidar.model.report.html

# 4) Generate synthetic response draws
gam generate lidar.model.json bench/datasets/lidar.csv \
  --n-draws 3 \
  --out lidar.synthetic.csv

Formula Language

gam fit expects:

response ~ term + term + ...

Response forms:

  • Standard regression/classification: y
  • Survival: Surv(entry, exit, event)

Important constraints:

  • Surv(...) currently requires exactly three columns
  • Intercept removal (0 or -1) is not supported
  • At most one link(...), one linkwiggle(...), one timewiggle(...), and one survmodel(...) may appear in a formula

Bare RHS terms

A bare column on the right-hand side is interpreted from the training schema:

  • Continuous or binary column: penalized linear term
  • Categorical column: random-effect block

Term wrappers

Linear and constrained coefficients:

  • linear(x)
  • linear(x, min=..., max=...)
  • constrain(x, min=..., max=...)
  • nonnegative(x) / nonnegative_coef(x)
  • nonpositive(x) / nonpositive_coef(x)
  • bounded(x, min=..., max=...)

bounded(...) also supports:

  • prior=none|uniform|log-jacobian|center
  • beta_a=..., beta_b=...
  • target=..., strength=...

Random effects:

  • group(x) or re(x)

Smooths:

  • smooth(...) or s(...)
  • thinplate(...), thin_plate(...), tps(...)
  • matern(...)
  • duchon(...)
  • tensor(...), interaction(...), te(...)

Formula-level configuration terms:

  • link(type=...)
  • linkwiggle(...)
  • timewiggle(...)
  • survmodel(spec=..., distribution=...)

Smooth defaults

  • smooth(x) with one variable defaults to a B-spline / P-spline style basis
  • smooth(x1, x2, ...) defaults to thin-plate
  • te(...) defaults to tensor-product B-splines

Notable smooth options:

  • B-spline: degree, knots, k, penalty_order
  • Thin-plate: centers or k
  • Matérn: centers or k, nu, length_scale
  • Duchon: centers or k, power, order, optional length_scale
  • Tensor: k / basis_dim for marginal basis size

Spatial smooths can use per-axis anisotropy:

  • Global CLI flag: --scale-dimensions
  • Per-term override: scale_dims=true or scale_dims=false

Fit Modes

1. Standard mean-only fits

gam fit train.csv 'y ~ age + smooth(bmi) + group(site)' --out model.json

Auto family resolution:

  • Binary {0,1} response -> binomial logit
  • Anything else -> gaussian identity
  • --predict-noise does not change that default; write link(type=probit) (or another explicit link) in the mean formula when you want a different binomial base link

2. Location-scale fits

Use a second formula for the scale/noise block:

gam fit train.csv 'y ~ x1 + smooth(x2)' \
  --predict-noise 'y ~ smooth(x1)' \
  --out locscale.model.json

If you want a probit-vs-probit comparison between mean-only and location-scale fits, declare the link explicitly in both formulas:

gam fit train.csv 'y ~ x1 + smooth(x2) + link(type=probit)' \
  --out probit.model.json

gam fit train.csv 'y ~ x1 + smooth(x2) + link(type=probit)' \
  --predict-noise 'y ~ smooth(x1)' \
  --out probit.locscale.model.json

The CLI exposes this path for Gaussian and binomial families, and for Surv(...) formulas it routes into the survival location-scale fitter. Runtime behavior is still uneven enough that you should treat it as experimental and verify it on your exact formula/data combination before relying on it.

3. Survival fits

Use Surv(entry, exit, event) on the left-hand side:

gam fit train.csv \
  'Surv(entry_time, exit_time, event) ~ age + smooth(bmi) + survmodel(spec=net, distribution=gaussian)' \
  --survival-likelihood transformation \
  --out survival.model.json

Current survival likelihood modes:

  • transformation
  • weibull
  • location-scale

Distributional survival fits can use a second formula for log-sigma:

gam fit train.csv \
  'Surv(entry_time, exit_time, event) ~ age + smooth(bmi) + survmodel(spec=net, distribution=gaussian)' \
  --predict-noise 'Surv(entry_time, exit_time, event) ~ smooth(age)' \
  --out survival-ls.model.json

When --predict-noise is present on a Surv(...) formula, the CLI uses the survival location-scale fit path.

Current survival-specific formula/config support:

  • survmodel(spec=net, distribution=...)
  • timewiggle(...)
  • link(...)
  • linkwiggle(...) only in supported survival sub-modes

4. Bernoulli marginal-slope fits

This is an advanced binary-response mode that adds a second formula for the log-slope surface plus an auxiliary standardized score column:

gam fit scores.csv \
  'case ~ smooth(age) + matern(pc1, pc2, pc3)' \
  --logslope-formula 'case ~ matern(pc1, pc2, pc3)' \
  --z-column prs_z \
  --out marginal.model.json

Current restrictions:

  • Response must be binary {0,1}
  • --predict-noise is not allowed
  • --firth is not allowed
  • link(...) and linkwiggle(...) are not allowed in this family or in --logslope-formula

Link Functions

Links are configured in-formula via link(type=...).

Supported type values:

  • identity
  • logit
  • probit
  • cloglog
  • sas
  • beta-logistic
  • blended(a,b,...) / mixture(a,b,...)
  • flexible(<single-link>)
  • flexible(blended(...))

Advanced link parameters:

  • rho= for blended/mixture links
  • sas_init="epsilon,log_delta"
  • beta_logistic_init="epsilon,delta"

Output and Data Semantics

Saved models

  • gam fit writes nothing unless --out is provided
  • Saved model JSON includes training schema and header metadata
  • Prediction-like commands reload new data using that saved schema
  • If a model predates current metadata requirements, refit it with the current CLI

Prediction CSV schema

Standard and Bernoulli marginal-slope models:

  • default: eta,mean
  • with --uncertainty: eta,mean,effective_se,mean_lower,mean_upper

Gaussian location-scale models, when the fit path succeeds:

  • default: eta,mean,sigma
  • with --uncertainty: eta,mean,sigma,mean_lower,mean_upper

Survival models:

  • default: eta,mean,survival_prob,risk_score,failure_prob
  • with --uncertainty: eta,mean,survival_prob,risk_score,failure_prob,effective_se,mean_lower,mean_upper

Notes:

  • In survival output, mean is the same quantity as survival_prob
  • risk_score is risk-oriented and currently tracks the linear predictor direction
  • effective_se is estimator uncertainty, not observation noise

Sampling output

gam sample writes:

  • Raw draws CSV with columns beta_0, beta_1, ...
  • A second summary CSV at <out with extension summary.csv>

Defaults when --out is omitted:

  • Draws: posterior_samples.csv
  • Summary: posterior_samples.summary.csv

Current sampling support:

  • Standard models
  • Survival models on the non-location-scale path

Not currently available for:

  • Gaussian location-scale models
  • Binomial location-scale models
  • Bernoulli marginal-slope models

Synthetic generation output

gam generate writes a numeric matrix:

  • One row per sampled dataset
  • One column per conditioning-data row
  • Column names are draw_0, draw_1, ... indexed by input row position

Defaults when --out is omitted:

  • synthetic.csv

Not currently available for:

  • Survival models
  • Bernoulli marginal-slope models

Report output

gam report <MODEL> [DATA] [OUT] writes:

  • [OUT] if provided
  • Otherwise <model-stem>.report.html in the current working directory

The report is standalone HTML. With data input it includes data-dependent diagnostics; without data input those sections are omitted.

Schema compatibility

Prediction, reporting, sampling, and generation expect the new data to match the saved training schema:

  • Column names must match
  • Column types must match
  • Unseen categorical levels are treated as errors

Current CLI Limitations

  • diagnose currently only exposes --alo
  • diagnose --alo is not supported for models containing bounded(...) coefficients
  • --predict-noise is exposed in the CLI, but current Gaussian, binomial, and survival location-scale fits still have rough edges; verify behavior on your exact workload before depending on that path
  • linkwiggle(...) belongs in the mean formula, not --predict-noise
  • Flexible links are only supported in specific binomial and survival paths
  • Some benchmark datasets in bench/datasets/ are meant for harness scenarios rather than copy-paste README demos

Development

Common local checks:

cargo fmt --all
cargo clippy --all-targets --all-features -- -A warnings -D clippy::correctness -D clippy::suspicious
cargo test --all-features -- --nocapture

Benchmark harness:

python3 bench/run_suite.py --help
python3 bench/run_suite.py

Repository layout:

  • src/: CLI, model code, fitting/inference, smooth construction, and survival machinery
  • bench/: benchmark harness, scenario configs, datasets, and comparison tooling
  • tests/: Rust integration tests plus benchmark helper tools

Lean checks for the Rust-matched .lean files under src/:

./scripts/lean-check-all.sh