# gam
`gam` is a formula-first CLI and Rust engine for generalized additive models.
It fits Gaussian, binomial, Poisson, and Gamma GLMs with smooth terms, random effects, location-scale extensions, survival likelihoods, and flexible/learnable link functions. Smoothing parameters are selected by REML or LAML. Posterior sampling uses NUTS.
Please open an issue if anything doesn't work as expected, if you'd like a new feature, or for questions.
## What's different
- **Three-part penalty structure.** Each smooth gets separate penalties for magnitude, gradient, and curvature. Most GAM libraries use one (curvature only) or two (curvature + combined magnitude/gradient). The three-part structure gives the smoother more degrees of freedom to distinguish flat-but-offset functions from wiggly ones.
- **Flexible link functions.** A spline offset from a base link (e.g. probit) lets the data correct for link misspecification while encoding the belief that the base link is approximately right. Marginal-slope models use a calibrated de-nested probit transport kernel for score-warp/link-deviation terms, not an exact nested link composition or post-hoc calibration. The same mechanism applies to survival time basis functions.
- **Surface smooths.** Thin-plate splines, Duchon radial bases with triple operator regularization, and Matern covariance-based smooths in arbitrary dimension, with automatic knot placement.
- **Adaptive anisotropy.** Per-axis spatial anisotropy (`--scale-dimensions`) lets the model shrink or stretch each feature axis independently within a single joint smooth, instead of assuming isotropic smoothness. Matérn and hybrid Duchon optimize a global scale plus per-axis contrasts; pure Duchon optimizes the per-axis contrasts directly without introducing a global length scale.
- **Composable basis/kernel.** You can combine the kernel of one spline family with the length-scale behavior of another (e.g. Duchon kernel with Matern-style global kappa scaling).
## Install
### Prebuilt binary
macOS, Linux, and Windows Git Bash:
```bash
curl -fsSL https://raw.githubusercontent.com/SauersML/gam/main/install.sh | bash
```
### Build from source
Requires [Rust](https://rustup.rs/).
```bash
git clone https://github.com/SauersML/gam.git
cd gam
cargo build --release
```
The binary is at `./target/release/gam`. Add it to your `PATH` or use the full path in the examples below.
### Python package
The repo now includes a mixed Rust/Python package built around PyO3 and maturin.
```bash
uv venv
source .venv/bin/activate
uv pip install maturin
maturin develop --manifest-path crates/gam-pyffi/Cargo.toml
python -c "import gamfit; print(gamfit.build_info())"
```
Formula-first usage:
```python
import gamfit
train = [
{"y": 1.0, "x": 0.0},
{"y": 2.0, "x": 1.0},
{"y": 3.0, "x": 2.0},
]
model = gamfit.fit(train, "y ~ x")
pred = model.predict([{"x": 1.5}, {"x": 2.5}], interval=0.95)
summary = model.summary()
check = model.check([{"x": 1.5}])
diagnostics = model.diagnose(train)
gamfit.validate_formula(train, "y ~ x")
model.plot(train, kind="prediction")
html = model.report()
model.save("linear.gam")
```
scikit-learn usage:
```python
from gamfit.sklearn import GAMRegressor
est = GAMRegressor(formula="y ~ x")
est.fit(train)
pred = est.predict([{"x": 1.5}, {"x": 2.5}])
```
The native extension is `gamfit._rust`, while the public Python API lives under `gamfit/`.
## Quick start
```bash
# Fit a GAM with a smooth term
gam fit data.csv 'y ~ smooth(x)' --out model.json
# Predict with uncertainty intervals
gam predict model.json new_data.csv --out predictions.csv --uncertainty
# Build a standalone HTML report
gam report model.json data.csv
# Draw posterior samples
gam sample model.json data.csv --out samples.csv
# Generate synthetic response draws
gam generate model.json data.csv --n-draws 5 --out synthetic.csv
```
## Commands
| Command | What it does | Usage |
| --- | --- | --- |
| `fit` | Fit a model | `gam fit <DATA> <FORMULA> [--out model.json]` |
| `predict` | Score new data | `gam predict <MODEL> <DATA> --out predictions.csv` |
| `report` | Standalone HTML report | `gam report <MODEL> [DATA] [OUT]` |
| `diagnose` | Terminal diagnostics | `gam diagnose <MODEL> <DATA>` |
| `sample` | Posterior draws (NUTS) | `gam sample <MODEL> <DATA> [--out samples.csv]` |
| `generate` | Synthetic outcomes | `gam generate <MODEL> <DATA> [--out synthetic.csv]` |
`train` is an alias for `fit`. `simulate` is an alias for `generate`.
Run `gam <command> --help` for full options.
## Formula language
```
response ~ term + term + ...
```
### Response
- Continuous, binary, count, or positive continuous: `y`
- Survival (interval-censored): `Surv(entry_time, exit_time, event)`
### Terms
**Linear and constrained coefficients:**
| Syntax | Effect |
| --- | --- |
| `x` or `linear(x)` | Penalized linear term |
| `linear(x, min=0)` | Non-negative coefficient |
| `linear(x, min=..., max=...)` | Box-constrained coefficient |
| `nonnegative(x)` | Sugar for `linear(x, min=0)` |
| `nonpositive(x)` | Sugar for `linear(x, max=0)` |
| `bounded(x, min=0, max=1)` | Exact interval transform (no ridge) |
| `bounded(x, ..., prior=uniform)` | Flat prior on bounded scale |
| `bounded(x, ..., target=0.5, strength=3)` | Informative interior prior |
**Random effects:**
| Syntax | Effect |
| --- | --- |
| `group(id)` or `re(id)` | Random intercept per level of `id` |
**Smooths:**
| Syntax | Default basis |
| --- | --- |
| `smooth(x)` or `s(x)` | P-spline (B-spline + difference penalty) |
| `smooth(x1, x2)` | Thin-plate spline |
| `thinplate(x1, x2)` or `tps(x1, x2)` | Thin-plate spline |
| `matern(x1, x2, ...)` | Matern covariance smooth |
| `duchon(x1, x2, ...)` | Duchon radial basis with triple operator regularization (scale-free) |
| `tensor(x, z)` or `te(x, z)` | Tensor-product B-splines |
Common smooth options: `knots=`, `k=`, `centers=`, `degree=`, `penalty_order=`, `type=ps|tps|matern|duchon`. `double_penalty=true|false` applies to P-spline, thin-plate, tensor, and Matérn smooths; Duchon smooths use mass, tension, and stiffness operator penalties.
Spatial smooths support per-axis anisotropy via `scale_dims=true` or the global `--scale-dimensions` flag. For pure Duchon this stays scale-free: the optimizer updates only centered per-axis shape contrasts, not a scalar `length_scale`.
**Formula-level configuration:**
| Syntax | Effect |
| --- | --- |
| `link(type=logit)` | Set link function |
| `linkwiggle(internal_knots=10)` | Spline deviation from the base link |
| `timewiggle(internal_knots=8)` | Spline deviation from the time basis (survival) |
| `survmodel(spec=net, distribution=gaussian)` | Survival model configuration |
### Auto-detection
The family is inferred from the response column:
- Binary `{0, 1}`: binomial with logit link
- Everything else: Gaussian with identity link
Override with `link(type=...)` in the formula. Poisson and Gamma families are available via explicit link specification.
## Fit modes
### Standard
```bash
gam fit data.csv 'y ~ age + smooth(bmi) + group(site)' --out model.json
```
### Location-scale (jointly model mean and variance)
```bash
gam fit data.csv 'y ~ smooth(x1) + smooth(x2)' \
--predict-noise 'smooth(x1)' \
--out model.json
```
Works for Gaussian and binomial families. For survival formulas, `--predict-noise` routes to the survival location-scale fitter.
### Survival
```bash
gam fit data.csv \
'Surv(t0, t1, event) ~ age + smooth(bmi) + survmodel(spec=net, distribution=gaussian)' \
--survival-likelihood transformation \
--out model.json
```
Likelihood modes: `transformation`, `weibull`, `location-scale`.
Add `--predict-noise` for distributional (location-scale) survival:
```bash
gam fit data.csv \
'Surv(t0, t1, event) ~ age + smooth(bmi) + survmodel(spec=net, distribution=gaussian)' \
--predict-noise 'smooth(age)' \
--out model.json
```
### Bernoulli marginal-slope
Models `P(case | covariates, z)` where `z` is a standardized score (e.g. a polygenic risk score). The key idea: the baseline risk surface and the effect of `z` are decoupled into separate formulas. The main formula controls the population-level risk landscape (how risk varies with age, ancestry PCs, etc.), while `--logslope-formula` controls how strongly `z` modifies that risk at each point in covariate space. This decoupling lets you estimate spatially-varying effect sizes for `z` without the baseline absorbing signal that belongs to the slope, or vice versa.
```bash
gam fit data.csv \
'case ~ smooth(age) + matern(pc1, pc2, pc3)' \
--logslope-formula 'matern(pc1, pc2, pc3)' \
--z-column prs_z \
--out model.json
```
## Link functions
Set via `link(type=...)` in the formula.
| Link | Syntax |
| --- | --- |
| Identity | `link(type=identity)` |
| Logit | `link(type=logit)` |
| Probit | `link(type=probit)` |
| Complementary log-log | `link(type=cloglog)` |
| SAS (sinh-arcsinh) | `link(type=sas)` |
| Beta-logistic | `link(type=beta-logistic)` |
| Blended mixture | `link(type=blended(logit, probit))` |
| Flexible (data-driven) | `link(type=flexible(logit))` |
Flexible links add a spline offset to the base link, letting the data correct for link misspecification.
## Prediction output
| Model type | Default columns | With `--uncertainty` |
| --- | --- | --- |
| Standard / binomial | `eta, mean` | `+ effective_se, mean_lower, mean_upper` |
| Gaussian location-scale | `eta, mean, sigma` | `+ mean_lower, mean_upper` |
| Survival | `eta, mean, survival_prob, risk_score, failure_prob` | `+ effective_se, mean_lower, mean_upper` |
## Other outputs
**`gam report`** writes a standalone HTML file with model summary, smooth plots, and diagnostics. Pass training data for data-dependent diagnostics.
**`gam sample`** writes posterior draws (`beta_0, beta_1, ...`) and a summary CSV. Uses NUTS (No-U-Turn Sampler).
**`gam generate`** writes a matrix of synthetic outcomes (rows = draws, columns = data rows).
**`gam diagnose`** prints terminal diagnostics. Supports `--alo` for approximate leave-one-out.
## Development
```bash
cargo fmt --all
cargo clippy --all-targets --all-features -- -A warnings -D clippy::correctness -D clippy::suspicious
cargo test --all-features
```
Benchmark suite:
```bash
python3 bench/run_suite.py --help
python3 bench/run_suite.py
```
Layout:
- `src/` -- CLI, fitting engine, inference, smooth construction, survival machinery
- `bench/` -- benchmark harness, scenario configs, datasets, comparison tooling
- `tests/` -- integration tests