# opentslm
A Rust re-implementation of [OpenTSLM](https://github.com/StanfordBDHG/OpenTSLM) using
**[Burn 0.20](https://burn.dev/) + WGPU/Metal** for the trainable encoder and
**[llama-cpp-4](https://crates.io/crates/llama-cpp-4)** for the frozen LLM backbone.
No Python required — datasets, training, inference, and metric plotting all run from a
single Rust binary.
Default model: **Qwen3-4B Q4\_K\_M** (~2.5 GB) from
[unsloth/Qwen3-4B-GGUF](https://huggingface.co/unsloth/Qwen3-4B-GGUF) — fits inside
10 GB RAM.
---
## Architecture
```
Time series ──► TransformerCnnEncoder ──► mean-pool ──► LogitBiasHead
(Burn / WGPU, trainable) Linear(enc→vocab)
│
additive bias ▼
Text prompt ──► llama.cpp (GGUF, frozen) ──► base logits ──► adjusted logits
│
cross-entropy loss ▼
```
**Frozen backbone** — Qwen3 GGUF loaded via llama-cpp-4. Any quant level works;
Q4\_K\_M is the default.
**Trainable components** (Burn/WGPU):
- `TransformerCnnEncoder` — conv-patch front-end + 6-layer transformer
- `LogitBiasHead` (`Linear(enc_dim=128 → n_vocab)`) — projects mean-pooled embeddings
to a per-vocabulary additive logit bias; near-zero init so training starts from the
frozen LLM's unmodified distribution
**Training signal** — for each sample:
1. Single-pass encoder → pool → head → **bias** `[n_vocab]` ← differentiable
2. llama.cpp forward over `[prompt ‖ answer]` → **base logits** at answer positions ← constant
3. `adjusted = base_logits + bias` → cross-entropy vs answer tokens
4. Gradients flow only through encoder + logit-head; Qwen3 is never modified
---
## File layout
```
src/
├── main.rs CLI (train / eval / infer / download-data)
├── config.rs Hyper-parameters and default paths
├── data/
│ ├── batch.rs Sample struct, collate, padding
│ ├── downloader.rs Built-in wearable dataset downloader [feature: download]
│ ├── tsqa.rs TSQA multiple-choice QA loader
│ ├── m4.rs M4 captioning loader
│ ├── har.rs HAR chain-of-thought loader
│ ├── sleep.rs SleepEDF CoT loader
│ └── ecg.rs ECG QA CoT loader
├── model/
│ ├── encoder.rs TransformerCnnEncoder (Burn, trainable)
│ ├── projector.rs MlpProjector (reserved for future stages)
│ └── llm/
│ ├── llama_cpp.rs LlamaCppBackend — GGUF loader, tokeniser,
│ │ answer_logits(), generate()
│ └── opentslm_sp.rs OpenTslmSp — encode_to_logit_bias(),
│ compute_loss(), compute_loss_and_metrics(),
│ generate()
└── training/
├── mod.rs
├── curriculum.rs Five-stage curriculum trainer
└── metrics.rs EpochMetrics, StageMetrics, SVG/PNG plotting
./figures/ ← written at runtime, one sub-dir per stage
└── <stage>/
├── metrics.csv
├── loss_curve.svg
├── perplexity.svg
├── accuracy.svg
├── macro_recall.svg
└── index.html (self-contained: 2×2 SVG grid + data table)
```
---
## Curriculum stages
| `stage1_mcq` | Multiple-choice QA on time series | WISDM-W wrist accel → MCQ |
| `stage2_captioning` | Time-series captioning | WISDM-W wrist accel → captions |
| `stage3_cot` | HAR chain-of-thought | WISDM-W wrist accel + gyro |
| `stage4_sleep_cot` | Sleep-stage CoT | SleepEDF Fpz-Cz EEG |
| `stage5_ecg_cot` | ECG arrhythmia CoT | Synthetic 2-lead ECG |
Each stage loads the best checkpoint from the previous stage, trains with AdamW +
linear warm-up + early stopping, writes `test_predictions.jsonl`, and saves metric
plots to `./figures/<stage>/`.
---
## Results — Qwen3-4B Q4\_K\_M, 3 epochs, 2 000 train samples/stage

Each colour is one stage (blue → green → orange → purple → red).
Dashed vertical lines mark stage boundaries on the shared global-epoch axis.
Accuracy and recall panels start from stage 3 — stages 1–2 use free-form
generation targets (MCQ options / captions), so argmax accuracy is not meaningful.
| stage1\_mcq | 0.5017 | 1.65 | — | — |
| stage2\_captioning | 1.3232 | 3.76 | — | — |
| stage3\_cot | 1.9117 | 6.76 | 52.4 % | 44.8 % |
| stage4\_sleep\_cot | 1.8791 | 6.55 | 51.6 % | 49.1 % |
| stage5\_ecg\_cot | 1.3461 | 3.84 | 64.2 % | 62.1 % |
To regenerate the chart after a new run:
```bash
python3 plot_overview.py
# --figures figures/ alternative figures root
# --out path.png alternative output path
```
Per-stage interactive HTML reports (with per-epoch tables) live in
`figures/<stage>/index.html`.
---
## Quick start
### 1 — Install Rust
```bash
```
### 2 — Run
```bash
cd opentslm
cargo run --release -- train
```
On first run the binary:
1. **Downloads the model** — Qwen3-4B Q4\_K\_M (~2.5 GB) into
`~/.cache/huggingface/hub/`
2. **Downloads training data** — wearable sensor datasets from HuggingFace into
`data/`
3. **Trains all five curriculum stages** — checkpoints in `results/`, plots in
`./figures/`
Both downloads are skipped automatically on subsequent runs (hf-hub cache).
### Other commands
```bash
# Train specific stages only
cargo run --release -- train --stages stage1_mcq stage2_captioning
# Evaluate a stage
cargo run --release -- eval --stage stage3_cot
# Ask a question about a time series (no data needed)
cargo run --release -- infer \
--series "0.1,0.2,0.4,0.8,0.7,0.5,0.3,0.1" \
--prompt "What activity does this wrist sensor show?"
# Download datasets without training
cargo run --release -- download-data
cargo run --release -- download-data --limit 500 # quick smoke-test
cargo run --release -- download-data --only har # single source
```
### Model options
| `Qwen3-0.6B-Q4_K_M.gguf` | ~0.4 GB | fastest |
| `Qwen3-1.7B-Q4_K_M.gguf` | ~1.1 GB | lightweight |
| **`Qwen3-4B-Q4_K_M.gguf`** | **~2.5 GB** | **default** |
| `Qwen3-8B-Q4_K_M.gguf` | ~5.0 GB | best quality, fits 10 GB |
```bash
# Pre-download a specific model to the HF cache
huggingface-cli download unsloth/Qwen3-1.7B-GGUF Qwen3-1.7B-Q4_K_M.gguf
# Use a local GGUF file
cargo run --release -- train --model /path/to/model.gguf
```
Set `$HF_TOKEN` to avoid rate limits or access gated repos:
```bash
export HF_TOKEN=hf_...
```
---
## Metric plots
After each stage finishes, four SVG charts and one PNG summary are written to
`./figures/<stage>/`:
| `loss_curve.svg` | Train loss (blue) + val loss (red) per epoch |
| `perplexity.svg` | Val perplexity (`exp(val_loss)`) per epoch |
| `accuracy.svg` | Val token accuracy — fraction of answer positions where argmax = target |
| `macro_recall.svg` | Val macro-averaged recall over all answer token classes |
| `index.html` | Self-contained page: 2 × 2 SVG grid + per-epoch data table |
| `metrics.csv` | Raw numbers: `epoch,train_loss,val_loss,val_perplexity,val_accuracy,val_macro_recall` |
**Token accuracy** is computed by comparing `argmax(adjusted_logits)` to the target
token at each answer position — a strong signal for MCQ stages where the correct
answer is a single token (A/B/C/D).
**Macro recall** averages per-class recall over every unique target token seen in the
validation set. For MCQ this gives recall across the four answer options; for
captioning/CoT stages it reflects coverage over the vocabulary subset actually used.
---
## Cargo features
| `download` | ✓ | Built-in dataset downloader (hf-hub + parquet). Disable with `--no-default-features` to skip the Parquet dep. |
| `verbose` | ✗ | Print all cubecl autotune / cache log lines. By default these are capped at `WARN` so the training progress bar isn't buried under hundreds of "Tuning MatmulAutotuneKey …" messages. |
```bash
# Suppress dataset downloader
cargo build --release --no-default-features
# Show full cubecl autotune output
cargo run --release --features verbose -- train
# Override log level at runtime without recompiling
RUST_LOG=debug cargo run --release -- train
# Custom level for specific target
RUST_LOG=info,cubecl_runtime=debug cargo run --release -- train
```
`$RUST_LOG` is always honoured and takes priority over both `--log-level` and the
`verbose` feature flag.
---
## Troubleshooting
| Model download fails | Check internet / set `$HF_TOKEN` |
| Rate-limited by HuggingFace | Set `$HF_TOKEN` (free account removes limits) |
| Custom cache location | Set `$HF_HOME` (e.g. `export HF_HOME=/data/hf`) |
| OOM during training | Add `--batch-size 1` |
| Want full dataset | Set `MAX_TRAIN_SAMPLES = usize::MAX` in `config.rs` |
| Want more epochs | Increase `NUM_EPOCHS` in `config.rs` |
| Too many autotune log lines | Already suppressed by default; build with `--features verbose` to restore |
| Figures directory missing | Created automatically on first `train` run |
---
## Memory footprint (Apple M3 Max, 36 GB unified)
| Model weights | ~2.5 GB | Metal, all layers offloaded via llama.cpp |
| KV cache (2 048 ctx) | ~0.3 GB | grows with sequence length |
| Encoder + logit-head | ~0.1 GB | Burn/WGPU |
| AdamW optimiser state | ~0.2 GB | encoder params only |
| Activations (batch=4) | ~0.2 GB | peak during backward pass |
| **Total** | **~3.3 GB** | well inside 10 GB |
Qwen3-8B Q4\_K\_M raises the total to ~6 GB.
---
## Results layout
```
results/
└── Qwen3_4B_Q4_K_M/
└── OpenTSLMSP/
├── stage1_mcq/
│ ├── checkpoints/
│ │ ├── best_model.json ← epoch + val_loss metadata
│ │ └── loss_history.txt
│ └── results/
│ └── test_predictions.jsonl
├── stage2_captioning/ …
└── …
./figures/
├── stage1_mcq/
│ ├── metrics.csv
│ ├── loss_curve.svg
│ ├── perplexity.svg
│ ├── accuracy.svg
│ ├── macro_recall.svg
│ └── index.html (self-contained: 2×2 SVG grid + data table)
├── stage2_captioning/ …
└── …
```
---
## Hyperparameters (`src/config.rs`)
| `PATCH_SIZE` | 4 | conv stride and kernel size |
| `ENCODER_OUTPUT_DIM` | 128 | encoder embedding dimension |
| `ENCODER_NUM_HEADS` | 8 | transformer attention heads |
| `ENCODER_NUM_LAYERS` | 6 | transformer depth |
| `ENCODER_FF_DIM` | 1 024 | transformer feed-forward dim |
| `ENCODER_MAX_PATCHES` | 1 024 | max sequence patches |
| `LR_ENCODER` | 2 × 10⁻⁴ | AdamW learning rate |
| `WEIGHT_DECAY` | 0.01 | AdamW weight decay |
| `GRAD_CLIP_NORM` | 1.0 | gradient norm clip threshold |
| `WARMUP_FRAC` | 0.10 | fraction of steps for LR warm-up |
| `NUM_EPOCHS` | 3 | epochs per stage |
| `EARLY_STOP_PAT` | 3 | early-stopping patience |
| `BATCH_SIZE` | 4 | samples per gradient step |
| `MAX_TRAIN_SAMPLES` | 2 000 | per-split cap (`usize::MAX` for full run) |
| `N_GPU_LAYERS` | 999 | llama.cpp layers on GPU (all) |
| `CTX_SIZE` | 2 048 | KV-cache context window |
| `MAX_EVAL_TOKENS` | 64 | max tokens generated per sample during test evaluation |
---
## Design decisions
### Burn 0.20 with fusion disabled
Burn 0.20 ships with `burn-wgpu/fusion` on by default, which enables
`burn-cubecl-fusion` — a lazy kernel-fusion backend built on `burn-ir`. On Metal,
this backend has a `Handle::NotInit` panic that fires during the backward pass
when weight tensors are shared across multiple graph nodes (e.g. the transformer
parameters appearing in every encoder call).
Fix: `default-features = false` on the `burn` dependency prevents the top-level
crate from re-enabling `burn-wgpu?/default`. The `autotune` feature is kept
(it is orthogonal to fusion and improves GPU kernel selection) while `fusion` is
left out entirely. `burn-ir` and `burn-cubecl-fusion` do not appear in the
dependency graph.
### Single-pass batched encoder
The encoder is called **once per batch** across all time series from all samples.
An earlier per-sample loop called `encoder.forward()` once per sample and used
`Tensor::stack` to combine the results. Even with fusion disabled, combining
tensors from N separate forward passes via `stack` can confuse burn-ir's handle
lifecycle. The single-pass design allocates one flat `[total_ts, T]` tensor,
runs the encoder once, then averages per-sample with `Tensor::cat` on slices —
all handles live in the same graph session.
### Why llama-cpp-4 instead of a pure-Burn LLM?
A 4B-parameter model in f32 needs ~16 GB before any activations. GGUF Q4\_K\_M
cuts that to ~2.5 GB. llama.cpp also provides the exact tokeniser, Metal
offloading on Apple Silicon, and efficient KV-cache management.
### Why additive logit bias?
Embedding injection requires splitting the LLM's forward pass at the embedding
layer — not possible via the llama-cpp-4 C API. Additive logit bias achieves the
same conditioning effect: the encoder boosts answer tokens and suppresses
irrelevant vocabulary, with gradients flowing only through the bias head.
### Why near-zero init for the logit head?
`Linear(128 → 151936)` with Kaiming uniform produces biases of magnitude ~10 on
the first forward pass. Added to the LLM's calibrated logits (typically ±5–15),
this immediately corrupts the distribution and produces NaN loss within a few
batches. `Normal(mean=0, std=0.01)` starts the bias at ~0 — matching the frozen
LLM's unmodified distribution — and lets it grow only as gradient signal accumulates.
### Why hf-hub instead of a custom downloader?
`ApiRepo::download()` handles HTTP, ETags, on-disk caching, progress bars, and
retries. The `parquet` crate (with `snap` for Snappy decompression) reads
HuggingFace's `refs/convert/parquet` shards row by row.
### Logging
`tracing-subscriber` with `EnvFilter` is used throughout. Without the `verbose`
feature, `cubecl_runtime::tune` is capped at `WARN` so the progress bar is
readable. `RUST_LOG` always takes precedence over both `--log-level` and the
feature flag.
---
## Python → Rust mapping
| `TransformerCNNEncoder` | `src/model/encoder.rs` |
| `OpenTSLMSP.compute_loss` | `src/model/llm/opentslm_sp.rs` |
| `OpenTSLMSP.generate` | `src/model/llm/opentslm_sp.rs` |
| LLM forward pass | `src/model/llm/llama_cpp.rs` via llama-cpp-4 |
| Tokeniser | built into llama-cpp-4 / GGUF file |
| Dataset download scripts | `src/data/downloader.rs` (hf-hub + parquet) |
| `TSQADataset` | `src/data/tsqa.rs` |
| `M4QADataset` | `src/data/m4.rs` |
| `HARCoTQADataset` | `src/data/har.rs` |
| `SleepEDFCoTQADataset` | `src/data/sleep.rs` |
| `ECGQACoTQADataset` | `src/data/ecg.rs` |
| `CurriculumTrainer` | `src/training/curriculum.rs` |
| Metric plotting | `src/training/metrics.rs` (plotters, no Python) |
| `curriculum_learning.py` CLI | `src/main.rs` |
## License
MIT
## Citations
Please quote this work if you found it helpful.
```
@software{kosmyna2026opentslm,
author = {Kosmyna, Nataliya},
title = {opentslm: Pure-Rust implementation of Open Time-Series Language Models (OpenTSLM)},
year = {2026},
url = {https://github.com/nataliyakosmyna/opentslm}
}
Original paper:
@misc{langer2026opentslmtimeserieslanguagemodels,
title={OpenTSLM: Time-Series Language Models for Reasoning over Multivariate Medical Text- and Time-Series Data},
author={Patrick Langer and Thomas Kaar and Max Rosenblattl and Maxwell A. Xu and Winnie Chow and Martin Maritsch and Robert Jakob and Ning Wang and Juncheng Liu and Aradhana Verma and Brian Han and Daniel Seung Kim and Henry Chubb and Scott Ceresnak and Aydin Zahedivash and Alexander Tarlochan Singh Sandhu and Fatima Rodriguez and Daniel McDuff and Elgar Fleisch and Oliver Aalami and Filipe Barata and Paul Schmiedmayer},
year={2026},
eprint={2510.02410},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.02410},
}
```