opentslm
A Rust re-implementation of OpenTSLM using Burn 0.20 + WGPU/Metal for the trainable encoder and llama-cpp-4 for the frozen LLM backbone.
No Python required — datasets, training, inference, and metric plotting all run from a single Rust binary.
Default model: Qwen3-4B Q4_K_M (~2.5 GB) from unsloth/Qwen3-4B-GGUF — fits inside 10 GB RAM.
Architecture
Time series ──► TransformerCnnEncoder ──► mean-pool ──► LogitBiasHead
(Burn / WGPU, trainable) Linear(enc→vocab)
│
additive bias ▼
Text prompt ──► llama.cpp (GGUF, frozen) ──► base logits ──► adjusted logits
│
cross-entropy loss ▼
Frozen backbone — Qwen3 GGUF loaded via llama-cpp-4. Any quant level works; Q4_K_M is the default.
Trainable components (Burn/WGPU):
TransformerCnnEncoder— conv-patch front-end + 6-layer transformerLogitBiasHead(Linear(enc_dim=128 → n_vocab)) — projects mean-pooled embeddings to a per-vocabulary additive logit bias; near-zero init so training starts from the frozen LLM's unmodified distribution
Training signal — for each sample:
- Single-pass encoder → pool → head → bias
[n_vocab]← differentiable - llama.cpp forward over
[prompt ‖ answer]→ base logits at answer positions ← constant adjusted = base_logits + bias→ cross-entropy vs answer tokens- Gradients flow only through encoder + logit-head; Qwen3 is never modified
File layout
src/
├── main.rs CLI (train / eval / infer / download-data)
├── config.rs Hyper-parameters and default paths
├── data/
│ ├── batch.rs Sample struct, collate, padding
│ ├── downloader.rs Built-in wearable dataset downloader [feature: download]
│ ├── tsqa.rs TSQA multiple-choice QA loader
│ ├── m4.rs M4 captioning loader
│ ├── har.rs HAR chain-of-thought loader
│ ├── sleep.rs SleepEDF CoT loader
│ └── ecg.rs ECG QA CoT loader
├── model/
│ ├── encoder.rs TransformerCnnEncoder (Burn, trainable)
│ ├── projector.rs MlpProjector (reserved for future stages)
│ └── llm/
│ ├── llama_cpp.rs LlamaCppBackend — GGUF loader, tokeniser,
│ │ answer_logits(), generate()
│ └── opentslm_sp.rs OpenTslmSp — encode_to_logit_bias(),
│ compute_loss(), compute_loss_and_metrics(),
│ generate()
└── training/
├── mod.rs
├── curriculum.rs Five-stage curriculum trainer
└── metrics.rs EpochMetrics, StageMetrics, SVG/PNG plotting
./figures/ ← written at runtime, one sub-dir per stage
└── <stage>/
├── metrics.csv
├── loss_curve.svg
├── perplexity.svg
├── accuracy.svg
├── macro_recall.svg
└── index.html (self-contained: 2×2 SVG grid + data table)
Curriculum stages
| Stage | Task | Data source |
|---|---|---|
stage1_mcq |
Multiple-choice QA on time series | WISDM-W wrist accel → MCQ |
stage2_captioning |
Time-series captioning | WISDM-W wrist accel → captions |
stage3_cot |
HAR chain-of-thought | WISDM-W wrist accel + gyro |
stage4_sleep_cot |
Sleep-stage CoT | SleepEDF Fpz-Cz EEG |
stage5_ecg_cot |
ECG arrhythmia CoT | Synthetic 2-lead ECG |
Each stage loads the best checkpoint from the previous stage, trains with AdamW +
linear warm-up + early stopping, writes test_predictions.jsonl, and saves metric
plots to ./figures/<stage>/.
Results — Qwen3-4B Q4_K_M, 3 epochs, 2 000 train samples/stage

Each colour is one stage (blue → green → orange → purple → red). Dashed vertical lines mark stage boundaries on the shared global-epoch axis. Accuracy and recall panels start from stage 3 — stages 1–2 use free-form generation targets (MCQ options / captions), so argmax accuracy is not meaningful.
| Stage | Val loss | Perplexity | Token accuracy | Macro recall |
|---|---|---|---|---|
| stage1_mcq | 0.5017 | 1.65 | — | — |
| stage2_captioning | 1.3232 | 3.76 | — | — |
| stage3_cot | 1.9117 | 6.76 | 52.4 % | 44.8 % |
| stage4_sleep_cot | 1.8791 | 6.55 | 51.6 % | 49.1 % |
| stage5_ecg_cot | 1.3461 | 3.84 | 64.2 % | 62.1 % |
To regenerate the chart after a new run:
# --figures figures/ alternative figures root
# --out path.png alternative output path
Per-stage interactive HTML reports (with per-epoch tables) live in
figures/<stage>/index.html.
Quick start
1 — Install Rust
|
2 — Run
On first run the binary:
- Downloads the model — Qwen3-4B Q4_K_M (~2.5 GB) into
~/.cache/huggingface/hub/ - Downloads training data — wearable sensor datasets from HuggingFace into
data/ - Trains all five curriculum stages — checkpoints in
results/, plots in./figures/
Both downloads are skipped automatically on subsequent runs (hf-hub cache).
Other commands
# Train specific stages only
# Evaluate a stage
# Ask a question about a time series (no data needed)
# Download datasets without training
Model options
| File | Size | Notes |
|---|---|---|
Qwen3-0.6B-Q4_K_M.gguf |
~0.4 GB | fastest |
Qwen3-1.7B-Q4_K_M.gguf |
~1.1 GB | lightweight |
Qwen3-4B-Q4_K_M.gguf |
~2.5 GB | default |
Qwen3-8B-Q4_K_M.gguf |
~5.0 GB | best quality, fits 10 GB |
# Pre-download a specific model to the HF cache
# Use a local GGUF file
Set $HF_TOKEN to avoid rate limits or access gated repos:
Metric plots
After each stage finishes, four SVG charts and one PNG summary are written to
./figures/<stage>/:
| File | Content |
|---|---|
loss_curve.svg |
Train loss (blue) + val loss (red) per epoch |
perplexity.svg |
Val perplexity (exp(val_loss)) per epoch |
accuracy.svg |
Val token accuracy — fraction of answer positions where argmax = target |
macro_recall.svg |
Val macro-averaged recall over all answer token classes |
index.html |
Self-contained page: 2 × 2 SVG grid + per-epoch data table |
metrics.csv |
Raw numbers: epoch,train_loss,val_loss,val_perplexity,val_accuracy,val_macro_recall |
Token accuracy is computed by comparing argmax(adjusted_logits) to the target
token at each answer position — a strong signal for MCQ stages where the correct
answer is a single token (A/B/C/D).
Macro recall averages per-class recall over every unique target token seen in the validation set. For MCQ this gives recall across the four answer options; for captioning/CoT stages it reflects coverage over the vocabulary subset actually used.
Cargo features
| Feature | Default | Description |
|---|---|---|
download |
✓ | Built-in dataset downloader (hf-hub + parquet). Disable with --no-default-features to skip the Parquet dep. |
verbose |
✗ | Print all cubecl autotune / cache log lines. By default these are capped at WARN so the training progress bar isn't buried under hundreds of "Tuning MatmulAutotuneKey …" messages. |
# Suppress dataset downloader
# Show full cubecl autotune output
# Override log level at runtime without recompiling
RUST_LOG=debug
# Custom level for specific target
RUST_LOG=info,cubecl_runtime=debug
$RUST_LOG is always honoured and takes priority over both --log-level and the
verbose feature flag.
Troubleshooting
| Symptom | Fix |
|---|---|
| Model download fails | Check internet / set $HF_TOKEN |
| Rate-limited by HuggingFace | Set $HF_TOKEN (free account removes limits) |
| Custom cache location | Set $HF_HOME (e.g. export HF_HOME=/data/hf) |
| OOM during training | Add --batch-size 1 |
| Want full dataset | Set MAX_TRAIN_SAMPLES = usize::MAX in config.rs |
| Want more epochs | Increase NUM_EPOCHS in config.rs |
| Too many autotune log lines | Already suppressed by default; build with --features verbose to restore |
| Figures directory missing | Created automatically on first train run |
Memory footprint (Apple M3 Max, 36 GB unified)
| Component | Qwen3-4B Q4_K_M | Notes |
|---|---|---|
| Model weights | ~2.5 GB | Metal, all layers offloaded via llama.cpp |
| KV cache (2 048 ctx) | ~0.3 GB | grows with sequence length |
| Encoder + logit-head | ~0.1 GB | Burn/WGPU |
| AdamW optimiser state | ~0.2 GB | encoder params only |
| Activations (batch=4) | ~0.2 GB | peak during backward pass |
| Total | ~3.3 GB | well inside 10 GB |
Qwen3-8B Q4_K_M raises the total to ~6 GB.
Results layout
results/
└── Qwen3_4B_Q4_K_M/
└── OpenTSLMSP/
├── stage1_mcq/
│ ├── checkpoints/
│ │ ├── best_model.json ← epoch + val_loss metadata
│ │ └── loss_history.txt
│ └── results/
│ └── test_predictions.jsonl
├── stage2_captioning/ …
└── …
./figures/
├── stage1_mcq/
│ ├── metrics.csv
│ ├── loss_curve.svg
│ ├── perplexity.svg
│ ├── accuracy.svg
│ ├── macro_recall.svg
│ └── index.html (self-contained: 2×2 SVG grid + data table)
├── stage2_captioning/ …
└── …
Hyperparameters (src/config.rs)
| Parameter | Default | Notes |
|---|---|---|
PATCH_SIZE |
4 | conv stride and kernel size |
ENCODER_OUTPUT_DIM |
128 | encoder embedding dimension |
ENCODER_NUM_HEADS |
8 | transformer attention heads |
ENCODER_NUM_LAYERS |
6 | transformer depth |
ENCODER_FF_DIM |
1 024 | transformer feed-forward dim |
ENCODER_MAX_PATCHES |
1 024 | max sequence patches |
LR_ENCODER |
2 × 10⁻⁴ | AdamW learning rate |
WEIGHT_DECAY |
0.01 | AdamW weight decay |
GRAD_CLIP_NORM |
1.0 | gradient norm clip threshold |
WARMUP_FRAC |
0.10 | fraction of steps for LR warm-up |
NUM_EPOCHS |
3 | epochs per stage |
EARLY_STOP_PAT |
3 | early-stopping patience |
BATCH_SIZE |
4 | samples per gradient step |
MAX_TRAIN_SAMPLES |
2 000 | per-split cap (usize::MAX for full run) |
N_GPU_LAYERS |
999 | llama.cpp layers on GPU (all) |
CTX_SIZE |
2 048 | KV-cache context window |
MAX_EVAL_TOKENS |
64 | max tokens generated per sample during test evaluation |
Design decisions
Burn 0.20 with fusion disabled
Burn 0.20 ships with burn-wgpu/fusion on by default, which enables
burn-cubecl-fusion — a lazy kernel-fusion backend built on burn-ir. On Metal,
this backend has a Handle::NotInit panic that fires during the backward pass
when weight tensors are shared across multiple graph nodes (e.g. the transformer
parameters appearing in every encoder call).
Fix: default-features = false on the burn dependency prevents the top-level
crate from re-enabling burn-wgpu?/default. The autotune feature is kept
(it is orthogonal to fusion and improves GPU kernel selection) while fusion is
left out entirely. burn-ir and burn-cubecl-fusion do not appear in the
dependency graph.
Single-pass batched encoder
The encoder is called once per batch across all time series from all samples.
An earlier per-sample loop called encoder.forward() once per sample and used
Tensor::stack to combine the results. Even with fusion disabled, combining
tensors from N separate forward passes via stack can confuse burn-ir's handle
lifecycle. The single-pass design allocates one flat [total_ts, T] tensor,
runs the encoder once, then averages per-sample with Tensor::cat on slices —
all handles live in the same graph session.
Why llama-cpp-4 instead of a pure-Burn LLM?
A 4B-parameter model in f32 needs ~16 GB before any activations. GGUF Q4_K_M cuts that to ~2.5 GB. llama.cpp also provides the exact tokeniser, Metal offloading on Apple Silicon, and efficient KV-cache management.
Why additive logit bias?
Embedding injection requires splitting the LLM's forward pass at the embedding layer — not possible via the llama-cpp-4 C API. Additive logit bias achieves the same conditioning effect: the encoder boosts answer tokens and suppresses irrelevant vocabulary, with gradients flowing only through the bias head.
Why near-zero init for the logit head?
Linear(128 → 151936) with Kaiming uniform produces biases of magnitude ~10 on
the first forward pass. Added to the LLM's calibrated logits (typically ±5–15),
this immediately corrupts the distribution and produces NaN loss within a few
batches. Normal(mean=0, std=0.01) starts the bias at ~0 — matching the frozen
LLM's unmodified distribution — and lets it grow only as gradient signal accumulates.
Why hf-hub instead of a custom downloader?
ApiRepo::download() handles HTTP, ETags, on-disk caching, progress bars, and
retries. The parquet crate (with snap for Snappy decompression) reads
HuggingFace's refs/convert/parquet shards row by row.
Logging
tracing-subscriber with EnvFilter is used throughout. Without the verbose
feature, cubecl_runtime::tune is capped at WARN so the progress bar is
readable. RUST_LOG always takes precedence over both --log-level and the
feature flag.
Python → Rust mapping
| Python (PyTorch) | Rust |
|---|---|
TransformerCNNEncoder |
src/model/encoder.rs |
OpenTSLMSP.compute_loss |
src/model/llm/opentslm_sp.rs |
OpenTSLMSP.generate |
src/model/llm/opentslm_sp.rs |
| LLM forward pass | src/model/llm/llama_cpp.rs via llama-cpp-4 |
| Tokeniser | built into llama-cpp-4 / GGUF file |
| Dataset download scripts | src/data/downloader.rs (hf-hub + parquet) |
TSQADataset |
src/data/tsqa.rs |
M4QADataset |
src/data/m4.rs |
HARCoTQADataset |
src/data/har.rs |
SleepEDFCoTQADataset |
src/data/sleep.rs |
ECGQACoTQADataset |
src/data/ecg.rs |
CurriculumTrainer |
src/training/curriculum.rs |
| Metric plotting | src/training/metrics.rs (plotters, no Python) |
curriculum_learning.py CLI |
src/main.rs |
License
MIT
Citations
Please quote this work if you found it helpful.
@software{kosmyna2026opentslm,
author = {Kosmyna, Nataliya},
title = {opentslm: Pure-Rust implementation of Open Time-Series Language Models (OpenTSLM)},
year = {2026},
url = {https://github.com/nataliyakosmyna/opentslm}
}
Original paper:
@misc{langer2026opentslmtimeserieslanguagemodels,
title={OpenTSLM: Time-Series Language Models for Reasoning over Multivariate Medical Text- and Time-Series Data},
author={Patrick Langer and Thomas Kaar and Max Rosenblattl and Maxwell A. Xu and Winnie Chow and Martin Maritsch and Robert Jakob and Ning Wang and Juncheng Liu and Aradhana Verma and Brian Han and Daniel Seung Kim and Henry Chubb and Scott Ceresnak and Aydin Zahedivash and Alexander Tarlochan Singh Sandhu and Fatima Rodriguez and Daniel McDuff and Elgar Fleisch and Oliver Aalami and Filipe Barata and Paul Schmiedmayer},
year={2026},
eprint={2510.02410},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.02410},
}