opentslm-0.1.0 is not a library.

opentslm

A Rust re-implementation of OpenTSLM using Burn 0.20 + WGPU/Metal for the trainable encoder and llama-cpp-4 for the frozen LLM backbone.

No Python required — datasets, training, inference, and metric plotting all run from a single Rust binary.

Default model: Qwen3-4B Q4_K_M (~2.5 GB) from unsloth/Qwen3-4B-GGUF — fits inside 10 GB RAM.

Architecture

Time series  ──►  TransformerCnnEncoder  ──►  mean-pool  ──►  LogitBiasHead
                  (Burn / WGPU, trainable)                    Linear(enc→vocab)
                                                                     │
                                                              additive bias ▼
Text prompt  ──►  llama.cpp (GGUF, frozen)  ──►  base logits  ──►  adjusted logits
                                                                          │
                                                                cross-entropy loss ▼

Frozen backbone — Qwen3 GGUF loaded via llama-cpp-4. Any quant level works; Q4_K_M is the default.

Trainable components (Burn/WGPU):

TransformerCnnEncoder — conv-patch front-end + 6-layer transformer
LogitBiasHead (Linear(enc_dim=128 → n_vocab)) — projects mean-pooled embeddings to a per-vocabulary additive logit bias; near-zero init so training starts from the frozen LLM's unmodified distribution

Training signal — for each sample:

Single-pass encoder → pool → head → bias [n_vocab] ← differentiable
llama.cpp forward over [prompt ‖ answer] → base logits at answer positions ← constant
adjusted = base_logits + bias → cross-entropy vs answer tokens
Gradients flow only through encoder + logit-head; Qwen3 is never modified

File layout

src/
├── main.rs                        CLI (train / eval / infer / download-data)
├── config.rs                      Hyper-parameters and default paths
├── data/
│   ├── batch.rs                   Sample struct, collate, padding
│   ├── downloader.rs              Built-in wearable dataset downloader  [feature: download]
│   ├── tsqa.rs                    TSQA multiple-choice QA loader
│   ├── m4.rs                      M4 captioning loader
│   ├── har.rs                     HAR chain-of-thought loader
│   ├── sleep.rs                   SleepEDF CoT loader
│   └── ecg.rs                     ECG QA CoT loader
├── model/
│   ├── encoder.rs                 TransformerCnnEncoder (Burn, trainable)
│   ├── projector.rs               MlpProjector (reserved for future stages)
│   └── llm/
│       ├── llama_cpp.rs           LlamaCppBackend — GGUF loader, tokeniser,
│       │                          answer_logits(), generate()
│       └── opentslm_sp.rs         OpenTslmSp — encode_to_logit_bias(),
│                                  compute_loss(), compute_loss_and_metrics(),
│                                  generate()
└── training/
    ├── mod.rs
    ├── curriculum.rs              Five-stage curriculum trainer
    └── metrics.rs                 EpochMetrics, StageMetrics, SVG/PNG plotting
./figures/                           ← written at runtime, one sub-dir per stage
└── <stage>/
    ├── metrics.csv
    ├── loss_curve.svg
    ├── perplexity.svg
    ├── accuracy.svg
    ├── macro_recall.svg
    └── index.html                 (self-contained: 2×2 SVG grid + data table)

Curriculum stages

Stage	Task	Data source
`stage1_mcq`	Multiple-choice QA on time series	WISDM-W wrist accel → MCQ
`stage2_captioning`	Time-series captioning	WISDM-W wrist accel → captions
`stage3_cot`	HAR chain-of-thought	WISDM-W wrist accel + gyro
`stage4_sleep_cot`	Sleep-stage CoT	SleepEDF Fpz-Cz EEG
`stage5_ecg_cot`	ECG arrhythmia CoT	Synthetic 2-lead ECG

Each stage loads the best checkpoint from the previous stage, trains with AdamW + linear warm-up + early stopping, writes test_predictions.jsonl, and saves metric plots to ./figures/<stage>/.

Results — Qwen3-4B Q4_K_M, 3 epochs, 2 000 train samples/stage

Curriculum overview

Each colour is one stage (blue → green → orange → purple → red). Dashed vertical lines mark stage boundaries on the shared global-epoch axis. Accuracy and recall panels start from stage 3 — stages 1–2 use free-form generation targets (MCQ options / captions), so argmax accuracy is not meaningful.

Stage	Val loss	Perplexity	Token accuracy	Macro recall
stage1_mcq	0.5017	1.65	—	—
stage2_captioning	1.3232	3.76	—	—
stage3_cot	1.9117	6.76	52.4 %	44.8 %
stage4_sleep_cot	1.8791	6.55	51.6 %	49.1 %
stage5_ecg_cot	1.3461	3.84	64.2 %	62.1 %

To regenerate the chart after a new run:

python3 plot_overview.py
# --figures figures/   alternative figures root
# --out     path.png   alternative output path

Per-stage interactive HTML reports (with per-epoch tables) live in figures/<stage>/index.html.

Quick start

1 — Install Rust

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env

2 — Run

cd opentslm
cargo run --release -- train

On first run the binary:

Downloads the model — Qwen3-4B Q4_K_M (~2.5 GB) into ~/.cache/huggingface/hub/
Downloads training data — wearable sensor datasets from HuggingFace into data/
Trains all five curriculum stages — checkpoints in results/, plots in ./figures/

Both downloads are skipped automatically on subsequent runs (hf-hub cache).

Other commands

# Train specific stages only
cargo run --release -- train --stages stage1_mcq stage2_captioning

# Evaluate a stage
cargo run --release -- eval --stage stage3_cot

# Ask a question about a time series (no data needed)
cargo run --release -- infer \
    --series "0.1,0.2,0.4,0.8,0.7,0.5,0.3,0.1" \
    --prompt "What activity does this wrist sensor show?"

# Download datasets without training
cargo run --release -- download-data
cargo run --release -- download-data --limit 500   # quick smoke-test
cargo run --release -- download-data --only har    # single source

Model options

File	Size	Notes
`Qwen3-0.6B-Q4_K_M.gguf`	~0.4 GB	fastest
`Qwen3-1.7B-Q4_K_M.gguf`	~1.1 GB	lightweight
`Qwen3-4B-Q4_K_M.gguf`	~2.5 GB	default
`Qwen3-8B-Q4_K_M.gguf`	~5.0 GB	best quality, fits 10 GB

# Pre-download a specific model to the HF cache
huggingface-cli download unsloth/Qwen3-1.7B-GGUF Qwen3-1.7B-Q4_K_M.gguf

# Use a local GGUF file
cargo run --release -- train --model /path/to/model.gguf

Set $HF_TOKEN to avoid rate limits or access gated repos:

export HF_TOKEN=hf_...

Metric plots

After each stage finishes, four SVG charts and one PNG summary are written to ./figures/<stage>/:

File	Content
`loss_curve.svg`	Train loss (blue) + val loss (red) per epoch
`perplexity.svg`	Val perplexity (`exp(val_loss)`) per epoch
`accuracy.svg`	Val token accuracy — fraction of answer positions where argmax = target
`macro_recall.svg`	Val macro-averaged recall over all answer token classes
`index.html`	Self-contained page: 2 × 2 SVG grid + per-epoch data table
`metrics.csv`	Raw numbers: `epoch,train_loss,val_loss,val_perplexity,val_accuracy,val_macro_recall`

Token accuracy is computed by comparing argmax(adjusted_logits) to the target token at each answer position — a strong signal for MCQ stages where the correct answer is a single token (A/B/C/D).

Macro recall averages per-class recall over every unique target token seen in the validation set. For MCQ this gives recall across the four answer options; for captioning/CoT stages it reflects coverage over the vocabulary subset actually used.

Cargo features

Feature	Default	Description
`download`	✓	Built-in dataset downloader (hf-hub + parquet). Disable with `--no-default-features` to skip the Parquet dep.
`verbose`	✗	Print all cubecl autotune / cache log lines. By default these are capped at `WARN` so the training progress bar isn't buried under hundreds of "Tuning MatmulAutotuneKey …" messages.

# Suppress dataset downloader
cargo build --release --no-default-features

# Show full cubecl autotune output
cargo run --release --features verbose -- train

# Override log level at runtime without recompiling
RUST_LOG=debug cargo run --release -- train

# Custom level for specific target
RUST_LOG=info,cubecl_runtime=debug cargo run --release -- train

$RUST_LOG is always honoured and takes priority over both --log-level and the verbose feature flag.

Troubleshooting

Symptom	Fix
Model download fails	Check internet / set `$HF_TOKEN`
Rate-limited by HuggingFace	Set `$HF_TOKEN` (free account removes limits)
Custom cache location	Set `$HF_HOME` (e.g. `export HF_HOME=/data/hf`)
OOM during training	Add `--batch-size 1`
Want full dataset	Set `MAX_TRAIN_SAMPLES = usize::MAX` in `config.rs`
Want more epochs	Increase `NUM_EPOCHS` in `config.rs`
Too many autotune log lines	Already suppressed by default; build with `--features verbose` to restore
Figures directory missing	Created automatically on first `train` run

Memory footprint (Apple M3 Max, 36 GB unified)

Component	Qwen3-4B Q4_K_M	Notes
Model weights	~2.5 GB	Metal, all layers offloaded via llama.cpp
KV cache (2 048 ctx)	~0.3 GB	grows with sequence length
Encoder + logit-head	~0.1 GB	Burn/WGPU
AdamW optimiser state	~0.2 GB	encoder params only
Activations (batch=4)	~0.2 GB	peak during backward pass
Total	~3.3 GB	well inside 10 GB

Qwen3-8B Q4_K_M raises the total to ~6 GB.

Results layout

results/
└── Qwen3_4B_Q4_K_M/
    └── OpenTSLMSP/
        ├── stage1_mcq/
        │   ├── checkpoints/
        │   │   ├── best_model.json      ← epoch + val_loss metadata
        │   │   └── loss_history.txt
        │   └── results/
        │       └── test_predictions.jsonl
        ├── stage2_captioning/  …
        └── …

./figures/
├── stage1_mcq/
│   ├── metrics.csv
│   ├── loss_curve.svg
│   ├── perplexity.svg
│   ├── accuracy.svg
│   ├── macro_recall.svg
│   └── index.html                 (self-contained: 2×2 SVG grid + data table)
├── stage2_captioning/  …
└── …

Hyperparameters (`src/config.rs`)

Parameter	Default	Notes
`PATCH_SIZE`	4	conv stride and kernel size
`ENCODER_OUTPUT_DIM`	128	encoder embedding dimension
`ENCODER_NUM_HEADS`	8	transformer attention heads
`ENCODER_NUM_LAYERS`	6	transformer depth
`ENCODER_FF_DIM`	1 024	transformer feed-forward dim
`ENCODER_MAX_PATCHES`	1 024	max sequence patches
`LR_ENCODER`	2 × 10⁻⁴	AdamW learning rate
`WEIGHT_DECAY`	0.01	AdamW weight decay
`GRAD_CLIP_NORM`	1.0	gradient norm clip threshold
`WARMUP_FRAC`	0.10	fraction of steps for LR warm-up
`NUM_EPOCHS`	3	epochs per stage
`EARLY_STOP_PAT`	3	early-stopping patience
`BATCH_SIZE`	4	samples per gradient step
`MAX_TRAIN_SAMPLES`	2 000	per-split cap (`usize::MAX` for full run)
`N_GPU_LAYERS`	999	llama.cpp layers on GPU (all)
`CTX_SIZE`	2 048	KV-cache context window
`MAX_EVAL_TOKENS`	64	max tokens generated per sample during test evaluation

Design decisions

Burn 0.20 with fusion disabled

Burn 0.20 ships with burn-wgpu/fusion on by default, which enables burn-cubecl-fusion — a lazy kernel-fusion backend built on burn-ir. On Metal, this backend has a Handle::NotInit panic that fires during the backward pass when weight tensors are shared across multiple graph nodes (e.g. the transformer parameters appearing in every encoder call).

Fix: default-features = false on the burn dependency prevents the top-level crate from re-enabling burn-wgpu?/default. The autotune feature is kept (it is orthogonal to fusion and improves GPU kernel selection) while fusion is left out entirely. burn-ir and burn-cubecl-fusion do not appear in the dependency graph.

Single-pass batched encoder

The encoder is called once per batch across all time series from all samples. An earlier per-sample loop called encoder.forward() once per sample and used Tensor::stack to combine the results. Even with fusion disabled, combining tensors from N separate forward passes via stack can confuse burn-ir's handle lifecycle. The single-pass design allocates one flat [total_ts, T] tensor, runs the encoder once, then averages per-sample with Tensor::cat on slices — all handles live in the same graph session.

Why llama-cpp-4 instead of a pure-Burn LLM?

A 4B-parameter model in f32 needs ~16 GB before any activations. GGUF Q4_K_M cuts that to ~2.5 GB. llama.cpp also provides the exact tokeniser, Metal offloading on Apple Silicon, and efficient KV-cache management.

Why additive logit bias?

Embedding injection requires splitting the LLM's forward pass at the embedding layer — not possible via the llama-cpp-4 C API. Additive logit bias achieves the same conditioning effect: the encoder boosts answer tokens and suppresses irrelevant vocabulary, with gradients flowing only through the bias head.

Why near-zero init for the logit head?

Linear(128 → 151936) with Kaiming uniform produces biases of magnitude ~10 on the first forward pass. Added to the LLM's calibrated logits (typically ±5–15), this immediately corrupts the distribution and produces NaN loss within a few batches. Normal(mean=0, std=0.01) starts the bias at ~0 — matching the frozen LLM's unmodified distribution — and lets it grow only as gradient signal accumulates.

Why hf-hub instead of a custom downloader?

ApiRepo::download() handles HTTP, ETags, on-disk caching, progress bars, and retries. The parquet crate (with snap for Snappy decompression) reads HuggingFace's refs/convert/parquet shards row by row.

Logging

tracing-subscriber with EnvFilter is used throughout. Without the verbose feature, cubecl_runtime::tune is capped at WARN so the progress bar is readable. RUST_LOG always takes precedence over both --log-level and the feature flag.

Python → Rust mapping

Python (PyTorch)	Rust
`TransformerCNNEncoder`	`src/model/encoder.rs`
`OpenTSLMSP.compute_loss`	`src/model/llm/opentslm_sp.rs`
`OpenTSLMSP.generate`	`src/model/llm/opentslm_sp.rs`
LLM forward pass	`src/model/llm/llama_cpp.rs` via llama-cpp-4
Tokeniser	built into llama-cpp-4 / GGUF file
Dataset download scripts	`src/data/downloader.rs` (hf-hub + parquet)
`TSQADataset`	`src/data/tsqa.rs`
`M4QADataset`	`src/data/m4.rs`
`HARCoTQADataset`	`src/data/har.rs`
`SleepEDFCoTQADataset`	`src/data/sleep.rs`
`ECGQACoTQADataset`	`src/data/ecg.rs`
`CurriculumTrainer`	`src/training/curriculum.rs`
Metric plotting	`src/training/metrics.rs` (plotters, no Python)
`curriculum_learning.py` CLI	`src/main.rs`

License

MIT

Citations

Please quote this work if you found it helpful.

@software{kosmyna2026opentslm,
  author = {Kosmyna, Nataliya},
  title  = {opentslm: Pure-Rust implementation of Open Time-Series Language Models (OpenTSLM)},
  year   = {2026},
  url    = {https://github.com/nataliyakosmyna/opentslm}
}

Original paper:

@misc{langer2026opentslmtimeserieslanguagemodels,
      title={OpenTSLM: Time-Series Language Models for Reasoning over Multivariate Medical Text- and Time-Series Data}, 
      author={Patrick Langer and Thomas Kaar and Max Rosenblattl and Maxwell A. Xu and Winnie Chow and Martin Maritsch and Robert Jakob and Ning Wang and Juncheng Liu and Aradhana Verma and Brian Han and Daniel Seung Kim and Henry Chubb and Scott Ceresnak and Aydin Zahedivash and Alexander Tarlochan Singh Sandhu and Fatima Rodriguez and Daniel McDuff and Elgar Fleisch and Oliver Aalami and Filipe Barata and Paul Schmiedmayer},
      year={2026},
      eprint={2510.02410},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.02410}, 
}

opentslm 0.1.0