gam 0.1.14

Generalized penalized likelihood engine
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
# gam

`gam` is a formula-first CLI and Rust engine for penalized regression models.

The current CLI supports:

- Standard mean models with penalized linear terms, random effects, and smooths
- A surfaced location-scale fitting path via `--predict-noise`
- Survival models via `Surv(entry, exit, event)`
- An advanced Bernoulli marginal-slope workflow via `--logslope-formula` and `--z-column`
- Prediction, HTML reports, ALO diagnostics, posterior sampling, and synthetic-data generation

The CLI is the primary interface. The Rust modules exported by the crate are used internally and can change without compatibility guarantees.

## Requirements

- Rust `1.93+` for local source builds
- CSV input data with a header row
- A shell that supports quoted formulas (`bash`/`zsh` examples below)

## Install

### Prebuilt binary

macOS, Linux, and Windows Git Bash:

```bash
curl -fsSL https://raw.githubusercontent.com/SauersML/gam/main/install.sh | bash
```

### Build from source

```bash
cargo build --release --bin gam
./target/release/gam --help
```

If `gam` is not on your `PATH`, use `./target/release/gam` in the examples below.

## Command Overview

| Command | Purpose | Required arguments | Output |
| --- | --- | --- | --- |
| `gam fit` | Fit a model | `<DATA> <FORMULA>` | No file unless `--out` is provided |
| `gam predict` | Score new data | `<MODEL> <NEW_DATA>` plus `--out` | Prediction CSV |
| `gam report` | Build a standalone HTML report | `<MODEL> [DATA] [OUT]` | `[OUT]` or `<model-stem>.report.html` in the current directory |
| `gam diagnose` | Run terminal diagnostics | `<MODEL> <DATA>` | Prints a diagnostics table |
| `gam sample` | Draw posterior samples | `<MODEL> <DATA>` | `posterior_samples.csv` by default, plus a summary CSV |
| `gam generate` | Sample synthetic outcomes conditional on input rows | `<MODEL> <DATA>` | `synthetic.csv` by default |

Aliases:

- `gam train` -> `gam fit`
- `gam simulate` -> `gam generate`

Inspect full options with:

```bash
gam <command> --help
```

## Verified Quickstart

These commands were checked against the current binary and the checked-in `lidar` dataset.

```bash
# 1) Fit a Gaussian GAM
gam fit bench/datasets/lidar.csv \
  'logratio ~ smooth(range)' \
  --out lidar.model.json

# 2) Predict with uncertainty
gam predict lidar.model.json bench/datasets/lidar.csv \
  --out lidar.pred.csv --uncertainty

# 3) Build an HTML report
gam report lidar.model.json bench/datasets/lidar.csv
# writes: lidar.model.report.html

# 4) Generate synthetic response draws
gam generate lidar.model.json bench/datasets/lidar.csv \
  --n-draws 3 \
  --out lidar.synthetic.csv
```

## Formula Language

`gam fit` expects:

```text
response ~ term + term + ...
```

Response forms:

- Standard regression/classification: `y`
- Survival: `Surv(entry, exit, event)`

Important constraints:

- `Surv(...)` currently requires exactly three columns
- Intercept removal (`0` or `-1`) is not supported
- At most one `link(...)`, one `linkwiggle(...)`, one `timewiggle(...)`, and one `survmodel(...)` may appear in a formula

### Bare RHS terms

A bare column on the right-hand side is interpreted from the training schema:

- Continuous or binary column: penalized linear term
- Categorical column: random-effect block

### Term wrappers

Linear and constrained coefficients:

- `linear(x)`
- `linear(x, min=..., max=...)`
- `constrain(x, min=..., max=...)`
- `nonnegative(x)` / `nonnegative_coef(x)`
- `nonpositive(x)` / `nonpositive_coef(x)`
- `bounded(x, min=..., max=...)`

`bounded(...)` also supports:

- `prior=none|uniform|log-jacobian|center`
- `beta_a=..., beta_b=...`
- `target=..., strength=...`

Random effects:

- `group(x)` or `re(x)`

Smooths:

- `smooth(...)` or `s(...)`
- `thinplate(...)`, `thin_plate(...)`, `tps(...)`
- `matern(...)`
- `duchon(...)`
- `tensor(...)`, `interaction(...)`, `te(...)`

Formula-level configuration terms:

- `link(type=...)`
- `linkwiggle(...)`
- `timewiggle(...)`
- `survmodel(spec=..., distribution=...)`

### Smooth defaults

- `smooth(x)` with one variable defaults to a B-spline / P-spline style basis
- `smooth(x1, x2, ...)` defaults to thin-plate
- `te(...)` defaults to tensor-product B-splines

Notable smooth options:

- B-spline: `degree`, `knots`, `k`, `penalty_order`
- Thin-plate: `centers` or `k`
- Matérn: `centers` or `k`, `nu`, `length_scale`
- Duchon: `centers` or `k`, `power`, `order`, optional `length_scale`
- Tensor: `k` / `basis_dim` for marginal basis size

Spatial smooths can use per-axis anisotropy:

- Global CLI flag: `--scale-dimensions`
- Per-term override: `scale_dims=true` or `scale_dims=false`

## Fit Modes

### 1. Standard mean-only fits

```bash
gam fit train.csv 'y ~ age + smooth(bmi) + group(site)' --out model.json
```

Auto family resolution:

- Binary `{0,1}` response -> binomial logit
- Anything else -> gaussian identity
- `--predict-noise` does not change that default; write `link(type=probit)` (or another explicit link) in the mean formula when you want a different binomial base link

### 2. Location-scale fits

Use a second formula for the scale/noise block:

```bash
gam fit train.csv 'y ~ x1 + smooth(x2)' \
  --predict-noise 'y ~ smooth(x1)' \
  --out locscale.model.json
```

If you want a probit-vs-probit comparison between mean-only and location-scale
fits, declare the link explicitly in both formulas:

```bash
gam fit train.csv 'y ~ x1 + smooth(x2) + link(type=probit)' \
  --out probit.model.json

gam fit train.csv 'y ~ x1 + smooth(x2) + link(type=probit)' \
  --predict-noise 'y ~ smooth(x1)' \
  --out probit.locscale.model.json
```

The CLI exposes this path for Gaussian and binomial families, and for `Surv(...)` formulas it routes into the survival location-scale fitter. Runtime behavior is still uneven enough that you should treat it as experimental and verify it on your exact formula/data combination before relying on it.

### 3. Survival fits

Use `Surv(entry, exit, event)` on the left-hand side:

```bash
gam fit train.csv \
  'Surv(entry_time, exit_time, event) ~ age + smooth(bmi) + survmodel(spec=net, distribution=gaussian)' \
  --survival-likelihood transformation \
  --out survival.model.json
```

Current survival likelihood modes:

- `transformation`
- `weibull`
- `location-scale`

Distributional survival fits can use a second formula for log-sigma:

```bash
gam fit train.csv \
  'Surv(entry_time, exit_time, event) ~ age + smooth(bmi) + survmodel(spec=net, distribution=gaussian)' \
  --predict-noise 'Surv(entry_time, exit_time, event) ~ smooth(age)' \
  --out survival-ls.model.json
```

When `--predict-noise` is present on a `Surv(...)` formula, the CLI uses the survival `location-scale` fit path.

Current survival-specific formula/config support:

- `survmodel(spec=net, distribution=...)`
- `timewiggle(...)`
- `link(...)`
- `linkwiggle(...)` only in supported survival sub-modes

### 4. Bernoulli marginal-slope fits

This is an advanced binary-response mode that adds a second formula for the log-slope surface plus an auxiliary standardized score column:

```bash
gam fit scores.csv \
  'case ~ smooth(age) + matern(pc1, pc2, pc3)' \
  --logslope-formula 'case ~ matern(pc1, pc2, pc3)' \
  --z-column prs_z \
  --out marginal.model.json
```

Current restrictions:

- Response must be binary `{0,1}`
- `--predict-noise` is not allowed
- `--firth` is not allowed
- `link(...)` and `linkwiggle(...)` are not allowed in this family or in `--logslope-formula`

## Link Functions

Links are configured in-formula via `link(type=...)`.

Supported `type` values:

- `identity`
- `logit`
- `probit`
- `cloglog`
- `sas`
- `beta-logistic`
- `blended(a,b,...)` / `mixture(a,b,...)`
- `flexible(<single-link>)`
- `flexible(blended(...))`

Advanced link parameters:

- `rho=` for blended/mixture links
- `sas_init="epsilon,log_delta"`
- `beta_logistic_init="epsilon,delta"`

## Output and Data Semantics

### Saved models

- `gam fit` writes nothing unless `--out` is provided
- Saved model JSON includes training schema and header metadata
- Prediction-like commands reload new data using that saved schema
- If a model predates current metadata requirements, refit it with the current CLI

### Prediction CSV schema

Standard and Bernoulli marginal-slope models:

- default: `eta,mean`
- with `--uncertainty`: `eta,mean,effective_se,mean_lower,mean_upper`

Gaussian location-scale models, when the fit path succeeds:

- default: `eta,mean,sigma`
- with `--uncertainty`: `eta,mean,sigma,mean_lower,mean_upper`

Survival models:

- default: `eta,mean,survival_prob,risk_score,failure_prob`
- with `--uncertainty`: `eta,mean,survival_prob,risk_score,failure_prob,effective_se,mean_lower,mean_upper`

Notes:

- In survival output, `mean` is the same quantity as `survival_prob`
- `risk_score` is risk-oriented and currently tracks the linear predictor direction
- `effective_se` is estimator uncertainty, not observation noise

### Sampling output

`gam sample` writes:

- Raw draws CSV with columns `beta_0`, `beta_1`, ...
- A second summary CSV at `<out with extension summary.csv>`

Defaults when `--out` is omitted:

- Draws: `posterior_samples.csv`
- Summary: `posterior_samples.summary.csv`

Current sampling support:

- Standard models
- Survival models on the non-location-scale path

Not currently available for:

- Gaussian location-scale models
- Binomial location-scale models
- Bernoulli marginal-slope models

### Synthetic generation output

`gam generate` writes a numeric matrix:

- One row per sampled dataset
- One column per conditioning-data row
- Column names are `draw_0`, `draw_1`, ... indexed by input row position

Defaults when `--out` is omitted:

- `synthetic.csv`

Not currently available for:

- Survival models
- Bernoulli marginal-slope models

### Report output

`gam report <MODEL> [DATA] [OUT]` writes:

- `[OUT]` if provided
- Otherwise `<model-stem>.report.html` in the current working directory

The report is standalone HTML. With data input it includes data-dependent diagnostics; without data input those sections are omitted.

### Schema compatibility

Prediction, reporting, sampling, and generation expect the new data to match the saved training schema:

- Column names must match
- Column types must match
- Unseen categorical levels are treated as errors

## Current CLI Limitations

- `diagnose` currently only exposes `--alo`
- `diagnose --alo` is not supported for models containing `bounded(...)` coefficients
- `--predict-noise` is exposed in the CLI, but current Gaussian, binomial, and survival location-scale fits still have rough edges; verify behavior on your exact workload before depending on that path
- `linkwiggle(...)` belongs in the mean formula, not `--predict-noise`
- Flexible links are only supported in specific binomial and survival paths
- Some benchmark datasets in `bench/datasets/` are meant for harness scenarios rather than copy-paste README demos

## Development

Common local checks:

```bash
cargo fmt --all
cargo clippy --all-targets --all-features -- -A warnings -D clippy::correctness -D clippy::suspicious
cargo test --all-features -- --nocapture
```

Benchmark harness:

```bash
python3 bench/run_suite.py --help
python3 bench/run_suite.py
```

Repository layout:

- `src/`: CLI, model code, fitting/inference, smooth construction, and survival machinery
- `bench/`: benchmark harness, scenario configs, datasets, and comparison tooling
- `tests/`: Rust integration tests plus benchmark helper tools

Lean checks for the Rust-matched `.lean` files under `src/`:

```bash
./scripts/lean-check-all.sh
```