opentslm 0.1.0

Rust implementation of OpenTSLM using Burn, WGPU, and llama.cpp
opentslm-0.1.0 is not a library.

opentslm

A Rust re-implementation of OpenTSLM using Burn 0.20 + WGPU/Metal for the trainable encoder and llama-cpp-4 for the frozen LLM backbone.

No Python required — datasets, training, inference, and metric plotting all run from a single Rust binary.

Default model: Qwen3-4B Q4_K_M (~2.5 GB) from unsloth/Qwen3-4B-GGUF — fits inside 10 GB RAM.


Architecture

Time series  ──►  TransformerCnnEncoder  ──►  mean-pool  ──►  LogitBiasHead
                  (Burn / WGPU, trainable)                    Linear(enc→vocab)
                                                                     │
                                                              additive bias ▼
Text prompt  ──►  llama.cpp (GGUF, frozen)  ──►  base logits  ──►  adjusted logits
                                                                          │
                                                                cross-entropy loss ▼

Frozen backbone — Qwen3 GGUF loaded via llama-cpp-4. Any quant level works; Q4_K_M is the default.

Trainable components (Burn/WGPU):

  • TransformerCnnEncoder — conv-patch front-end + 6-layer transformer
  • LogitBiasHead (Linear(enc_dim=128 → n_vocab)) — projects mean-pooled embeddings to a per-vocabulary additive logit bias; near-zero init so training starts from the frozen LLM's unmodified distribution

Training signal — for each sample:

  1. Single-pass encoder → pool → head → bias [n_vocab] ← differentiable
  2. llama.cpp forward over [prompt ‖ answer]base logits at answer positions ← constant
  3. adjusted = base_logits + bias → cross-entropy vs answer tokens
  4. Gradients flow only through encoder + logit-head; Qwen3 is never modified

File layout

src/
├── main.rs                        CLI (train / eval / infer / download-data)
├── config.rs                      Hyper-parameters and default paths
├── data/
│   ├── batch.rs                   Sample struct, collate, padding
│   ├── downloader.rs              Built-in wearable dataset downloader  [feature: download]
│   ├── tsqa.rs                    TSQA multiple-choice QA loader
│   ├── m4.rs                      M4 captioning loader
│   ├── har.rs                     HAR chain-of-thought loader
│   ├── sleep.rs                   SleepEDF CoT loader
│   └── ecg.rs                     ECG QA CoT loader
├── model/
│   ├── encoder.rs                 TransformerCnnEncoder (Burn, trainable)
│   ├── projector.rs               MlpProjector (reserved for future stages)
│   └── llm/
│       ├── llama_cpp.rs           LlamaCppBackend — GGUF loader, tokeniser,
│       │                          answer_logits(), generate()
│       └── opentslm_sp.rs         OpenTslmSp — encode_to_logit_bias(),
│                                  compute_loss(), compute_loss_and_metrics(),
│                                  generate()
└── training/
    ├── mod.rs
    ├── curriculum.rs              Five-stage curriculum trainer
    └── metrics.rs                 EpochMetrics, StageMetrics, SVG/PNG plotting
./figures/                           ← written at runtime, one sub-dir per stage
└── <stage>/
    ├── metrics.csv
    ├── loss_curve.svg
    ├── perplexity.svg
    ├── accuracy.svg
    ├── macro_recall.svg
    └── index.html                 (self-contained: 2×2 SVG grid + data table)

Curriculum stages

Stage Task Data source
stage1_mcq Multiple-choice QA on time series WISDM-W wrist accel → MCQ
stage2_captioning Time-series captioning WISDM-W wrist accel → captions
stage3_cot HAR chain-of-thought WISDM-W wrist accel + gyro
stage4_sleep_cot Sleep-stage CoT SleepEDF Fpz-Cz EEG
stage5_ecg_cot ECG arrhythmia CoT Synthetic 2-lead ECG

Each stage loads the best checkpoint from the previous stage, trains with AdamW + linear warm-up + early stopping, writes test_predictions.jsonl, and saves metric plots to ./figures/<stage>/.


Results — Qwen3-4B Q4_K_M, 3 epochs, 2 000 train samples/stage

Curriculum overview

Each colour is one stage (blue → green → orange → purple → red). Dashed vertical lines mark stage boundaries on the shared global-epoch axis. Accuracy and recall panels start from stage 3 — stages 1–2 use free-form generation targets (MCQ options / captions), so argmax accuracy is not meaningful.

Stage Val loss Perplexity Token accuracy Macro recall
stage1_mcq 0.5017 1.65
stage2_captioning 1.3232 3.76
stage3_cot 1.9117 6.76 52.4 % 44.8 %
stage4_sleep_cot 1.8791 6.55 51.6 % 49.1 %
stage5_ecg_cot 1.3461 3.84 64.2 % 62.1 %

To regenerate the chart after a new run:

python3 plot_overview.py
# --figures figures/   alternative figures root
# --out     path.png   alternative output path

Per-stage interactive HTML reports (with per-epoch tables) live in figures/<stage>/index.html.


Quick start

1 — Install Rust

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env

2 — Run

cd opentslm
cargo run --release -- train

On first run the binary:

  1. Downloads the model — Qwen3-4B Q4_K_M (~2.5 GB) into ~/.cache/huggingface/hub/
  2. Downloads training data — wearable sensor datasets from HuggingFace into data/
  3. Trains all five curriculum stages — checkpoints in results/, plots in ./figures/

Both downloads are skipped automatically on subsequent runs (hf-hub cache).

Other commands

# Train specific stages only
cargo run --release -- train --stages stage1_mcq stage2_captioning

# Evaluate a stage
cargo run --release -- eval --stage stage3_cot

# Ask a question about a time series (no data needed)
cargo run --release -- infer \
    --series "0.1,0.2,0.4,0.8,0.7,0.5,0.3,0.1" \
    --prompt "What activity does this wrist sensor show?"

# Download datasets without training
cargo run --release -- download-data
cargo run --release -- download-data --limit 500   # quick smoke-test
cargo run --release -- download-data --only har    # single source

Model options

File Size Notes
Qwen3-0.6B-Q4_K_M.gguf ~0.4 GB fastest
Qwen3-1.7B-Q4_K_M.gguf ~1.1 GB lightweight
Qwen3-4B-Q4_K_M.gguf ~2.5 GB default
Qwen3-8B-Q4_K_M.gguf ~5.0 GB best quality, fits 10 GB
# Pre-download a specific model to the HF cache
huggingface-cli download unsloth/Qwen3-1.7B-GGUF Qwen3-1.7B-Q4_K_M.gguf

# Use a local GGUF file
cargo run --release -- train --model /path/to/model.gguf

Set $HF_TOKEN to avoid rate limits or access gated repos:

export HF_TOKEN=hf_...

Metric plots

After each stage finishes, four SVG charts and one PNG summary are written to ./figures/<stage>/:

File Content
loss_curve.svg Train loss (blue) + val loss (red) per epoch
perplexity.svg Val perplexity (exp(val_loss)) per epoch
accuracy.svg Val token accuracy — fraction of answer positions where argmax = target
macro_recall.svg Val macro-averaged recall over all answer token classes
index.html Self-contained page: 2 × 2 SVG grid + per-epoch data table
metrics.csv Raw numbers: epoch,train_loss,val_loss,val_perplexity,val_accuracy,val_macro_recall

Token accuracy is computed by comparing argmax(adjusted_logits) to the target token at each answer position — a strong signal for MCQ stages where the correct answer is a single token (A/B/C/D).

Macro recall averages per-class recall over every unique target token seen in the validation set. For MCQ this gives recall across the four answer options; for captioning/CoT stages it reflects coverage over the vocabulary subset actually used.


Cargo features

Feature Default Description
download Built-in dataset downloader (hf-hub + parquet). Disable with --no-default-features to skip the Parquet dep.
verbose Print all cubecl autotune / cache log lines. By default these are capped at WARN so the training progress bar isn't buried under hundreds of "Tuning MatmulAutotuneKey …" messages.
# Suppress dataset downloader
cargo build --release --no-default-features

# Show full cubecl autotune output
cargo run --release --features verbose -- train

# Override log level at runtime without recompiling
RUST_LOG=debug cargo run --release -- train

# Custom level for specific target
RUST_LOG=info,cubecl_runtime=debug cargo run --release -- train

$RUST_LOG is always honoured and takes priority over both --log-level and the verbose feature flag.


Troubleshooting

Symptom Fix
Model download fails Check internet / set $HF_TOKEN
Rate-limited by HuggingFace Set $HF_TOKEN (free account removes limits)
Custom cache location Set $HF_HOME (e.g. export HF_HOME=/data/hf)
OOM during training Add --batch-size 1
Want full dataset Set MAX_TRAIN_SAMPLES = usize::MAX in config.rs
Want more epochs Increase NUM_EPOCHS in config.rs
Too many autotune log lines Already suppressed by default; build with --features verbose to restore
Figures directory missing Created automatically on first train run

Memory footprint (Apple M3 Max, 36 GB unified)

Component Qwen3-4B Q4_K_M Notes
Model weights ~2.5 GB Metal, all layers offloaded via llama.cpp
KV cache (2 048 ctx) ~0.3 GB grows with sequence length
Encoder + logit-head ~0.1 GB Burn/WGPU
AdamW optimiser state ~0.2 GB encoder params only
Activations (batch=4) ~0.2 GB peak during backward pass
Total ~3.3 GB well inside 10 GB

Qwen3-8B Q4_K_M raises the total to ~6 GB.


Results layout

results/
└── Qwen3_4B_Q4_K_M/
    └── OpenTSLMSP/
        ├── stage1_mcq/
        │   ├── checkpoints/
        │   │   ├── best_model.json      ← epoch + val_loss metadata
        │   │   └── loss_history.txt
        │   └── results/
        │       └── test_predictions.jsonl
        ├── stage2_captioning/  …
        └── …

./figures/
├── stage1_mcq/
│   ├── metrics.csv
│   ├── loss_curve.svg
│   ├── perplexity.svg
│   ├── accuracy.svg
│   ├── macro_recall.svg
│   └── index.html                 (self-contained: 2×2 SVG grid + data table)
├── stage2_captioning/  …
└── …

Hyperparameters (src/config.rs)

Parameter Default Notes
PATCH_SIZE 4 conv stride and kernel size
ENCODER_OUTPUT_DIM 128 encoder embedding dimension
ENCODER_NUM_HEADS 8 transformer attention heads
ENCODER_NUM_LAYERS 6 transformer depth
ENCODER_FF_DIM 1 024 transformer feed-forward dim
ENCODER_MAX_PATCHES 1 024 max sequence patches
LR_ENCODER 2 × 10⁻⁴ AdamW learning rate
WEIGHT_DECAY 0.01 AdamW weight decay
GRAD_CLIP_NORM 1.0 gradient norm clip threshold
WARMUP_FRAC 0.10 fraction of steps for LR warm-up
NUM_EPOCHS 3 epochs per stage
EARLY_STOP_PAT 3 early-stopping patience
BATCH_SIZE 4 samples per gradient step
MAX_TRAIN_SAMPLES 2 000 per-split cap (usize::MAX for full run)
N_GPU_LAYERS 999 llama.cpp layers on GPU (all)
CTX_SIZE 2 048 KV-cache context window
MAX_EVAL_TOKENS 64 max tokens generated per sample during test evaluation

Design decisions

Burn 0.20 with fusion disabled

Burn 0.20 ships with burn-wgpu/fusion on by default, which enables burn-cubecl-fusion — a lazy kernel-fusion backend built on burn-ir. On Metal, this backend has a Handle::NotInit panic that fires during the backward pass when weight tensors are shared across multiple graph nodes (e.g. the transformer parameters appearing in every encoder call).

Fix: default-features = false on the burn dependency prevents the top-level crate from re-enabling burn-wgpu?/default. The autotune feature is kept (it is orthogonal to fusion and improves GPU kernel selection) while fusion is left out entirely. burn-ir and burn-cubecl-fusion do not appear in the dependency graph.

Single-pass batched encoder

The encoder is called once per batch across all time series from all samples. An earlier per-sample loop called encoder.forward() once per sample and used Tensor::stack to combine the results. Even with fusion disabled, combining tensors from N separate forward passes via stack can confuse burn-ir's handle lifecycle. The single-pass design allocates one flat [total_ts, T] tensor, runs the encoder once, then averages per-sample with Tensor::cat on slices — all handles live in the same graph session.

Why llama-cpp-4 instead of a pure-Burn LLM?

A 4B-parameter model in f32 needs ~16 GB before any activations. GGUF Q4_K_M cuts that to ~2.5 GB. llama.cpp also provides the exact tokeniser, Metal offloading on Apple Silicon, and efficient KV-cache management.

Why additive logit bias?

Embedding injection requires splitting the LLM's forward pass at the embedding layer — not possible via the llama-cpp-4 C API. Additive logit bias achieves the same conditioning effect: the encoder boosts answer tokens and suppresses irrelevant vocabulary, with gradients flowing only through the bias head.

Why near-zero init for the logit head?

Linear(128 → 151936) with Kaiming uniform produces biases of magnitude ~10 on the first forward pass. Added to the LLM's calibrated logits (typically ±5–15), this immediately corrupts the distribution and produces NaN loss within a few batches. Normal(mean=0, std=0.01) starts the bias at ~0 — matching the frozen LLM's unmodified distribution — and lets it grow only as gradient signal accumulates.

Why hf-hub instead of a custom downloader?

ApiRepo::download() handles HTTP, ETags, on-disk caching, progress bars, and retries. The parquet crate (with snap for Snappy decompression) reads HuggingFace's refs/convert/parquet shards row by row.

Logging

tracing-subscriber with EnvFilter is used throughout. Without the verbose feature, cubecl_runtime::tune is capped at WARN so the progress bar is readable. RUST_LOG always takes precedence over both --log-level and the feature flag.


Python → Rust mapping

Python (PyTorch) Rust
TransformerCNNEncoder src/model/encoder.rs
OpenTSLMSP.compute_loss src/model/llm/opentslm_sp.rs
OpenTSLMSP.generate src/model/llm/opentslm_sp.rs
LLM forward pass src/model/llm/llama_cpp.rs via llama-cpp-4
Tokeniser built into llama-cpp-4 / GGUF file
Dataset download scripts src/data/downloader.rs (hf-hub + parquet)
TSQADataset src/data/tsqa.rs
M4QADataset src/data/m4.rs
HARCoTQADataset src/data/har.rs
SleepEDFCoTQADataset src/data/sleep.rs
ECGQACoTQADataset src/data/ecg.rs
CurriculumTrainer src/training/curriculum.rs
Metric plotting src/training/metrics.rs (plotters, no Python)
curriculum_learning.py CLI src/main.rs

License

MIT

Citations

Please quote this work if you found it helpful.

@software{kosmyna2026opentslm,
  author = {Kosmyna, Nataliya},
  title  = {opentslm: Pure-Rust implementation of Open Time-Series Language Models (OpenTSLM)},
  year   = {2026},
  url    = {https://github.com/nataliyakosmyna/opentslm}
}

Original paper:

@misc{langer2026opentslmtimeserieslanguagemodels,
      title={OpenTSLM: Time-Series Language Models for Reasoning over Multivariate Medical Text- and Time-Series Data}, 
      author={Patrick Langer and Thomas Kaar and Max Rosenblattl and Maxwell A. Xu and Winnie Chow and Martin Maritsch and Robert Jakob and Ning Wang and Juncheng Liu and Aradhana Verma and Brian Han and Daniel Seung Kim and Henry Chubb and Scott Ceresnak and Aydin Zahedivash and Alexander Tarlochan Singh Sandhu and Fatima Rodriguez and Daniel McDuff and Elgar Fleisch and Oliver Aalami and Filipe Barata and Paul Schmiedmayer},
      year={2026},
      eprint={2510.02410},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.02410}, 
}