multiscreen-rs
A Rust implementation of the Multiscreen neural language model — training and inference — powered by Burn.
- CPU by default (Burn Flex with runtime SIMD detection: SSE, AVX, AVX2, AVX-512, NEON)
- CUDA GPU via
--features cuda— runs natively on NVIDIA GPUs - No built-in tokenizer — encode/decode with your own and pass
Vec<u32>token IDs directly - Streaming inference — token-by-token output like ChatGPT
- Chat-aware training — loss masking prevents the model from learning role labels (
system:,user:,assistant:) - Sampling-based generation — temperature, top-k, and repetition penalty for coherent output
- Eval-only mode — run evaluation + report on an existing checkpoint without retraining
Installation
[]
= "0.2"
CUDA GPU Support
[]
= { = "0.2", = ["cuda"] }
Quick Start — Training
use *;
Chat-Aware Training (Loss Masking)
When training on chat data with role labels (e.g. system: ..., user: ..., assistant: ...), use train_on_chat_sequences to mask loss on prompt tokens. The model learns to generate only the assistant's response:
use *;
Quick Start — Inference
use *;
Streaming (token by token, like ChatGPT)
use *;
Custom Sampling (temperature, top-k, repetition penalty)
For custom sampling strategies, use predict_logits to get raw logits and apply your own sampling:
use *;
Examples
The crate ships with two self-contained examples that use SentencePiece tokenization.
Train a Model
# CPU: Train 10M params, 10k steps
# CUDA: Train 50M params, 30k steps on GPU
# Quick test: 1M params, 500 steps
# Train with your own data
Eval-Only Mode (skip training)
If training completed but evaluation/report failed (e.g. crashed mid-way), you can re-run just the evaluation and report generation without retraining:
# Load existing checkpoint, evaluate, and write report
This loads latest.mpk + latest.json from the checkpoint directory, runs validation/test evaluation, measures inference latency, generates sample output, and writes the full report.json + report.md.
Chat with a Trained Model
# Interactive mode (streaming word-by-word output)
# One-shot prompt
# With custom sampling parameters
# With a custom system prompt
Chat CLI Options
| Option | Default | Description |
|---|---|---|
--run-dir |
runs/my-model |
Directory with checkpoints/ |
--checkpoint |
auto (latest.mpk) | Specific checkpoint path |
--prompt |
none (interactive) | One-shot prompt, skips interactive mode |
--max-new-tokens |
128 | Max tokens to generate per response |
--temperature |
0.8 | Sampling temperature (0 = greedy, 1.0 = normal, >1 = more random) |
--top-k |
40 | Only consider top K most likely tokens |
--repetition-penalty |
1.2 | Penalizes repeated tokens (>1.0 = stronger penalty) |
--system-prompt |
หมิว character | Custom system prompt |
Generate a Loss Plot
# Requires Python + matplotlib + numpy
Training CLI Options
| Option | Default | Description |
|---|---|---|
--train-dir |
examples/data |
Directory with tokenizer.model + .txt/.jsonl files |
--run-dir |
runs/my-model |
Output directory (checkpoints, reports, loss CSV) |
--budget |
10m |
Parameter budget: 1m, 5m, 10m, 50m, 100m |
--steps |
10000 |
Total optimizer steps |
--batch-size |
4 |
Batch size |
--seq-len |
128 |
Sequence length |
--lr |
0.0002 |
Learning rate |
--val-split |
0.1 |
Fraction of data for validation |
--log-interval |
100 |
Print loss every N steps |
--latency-tokens |
20 |
Tokens to generate for latency benchmark |
--eval-only |
false |
Skip training — load existing checkpoint and only run evaluation + report |
Training Reports
Every training run produces a complete report in --run-dir:
runs/my-model/
├── checkpoints/
│ ├── config.json # Model architecture config
│ ├── latest.mpk # Trained weights
│ └── latest.json # Run metadata
├── tokenizer.model # Copy of the tokenizer
├── loss.csv # Per-step loss values (step,loss)
├── report.json # Machine-readable full report
└── report.md # Human-readable training report
The report includes:
- Configuration — budget, parameter count, seq len, batch size, learning rate
- Training — duration, throughput (steps/s), final loss, best loss
- Validation — loss, perplexity, next-token accuracy
- Test — loss, perplexity, next-token accuracy
- Inference — average latency per token, total generation time
All files under runs/ are excluded from git via .gitignore.
Loss Plot
The loss CSV can be plotted with the bundled Python script:
This generates loss_plot.png in the same directory.
Evaluation Metrics
The model is automatically evaluated on held-out data after training:
| Metric | Description |
|---|---|
| Loss | Average cross-entropy loss |
| Perplexity | exp(loss) — lower is better |
| Accuracy | Fraction of tokens where argmax(logits) == target |
Data is split 80/10/10 (train/val/test) by default. Override with --val-split.
Architecture
Inference-Only Backend
ChatModel loads the model with the Autodiff backend (required for parameter loading), then calls .valid() to strip the autodiff wrapper. This produces a MultiscreenModel<DefaultBackend> that:
- Uses no gradient tracking during inference
- Does not build computation graphs → no VRAM leak
- Is safe for autoregressive generation loops that run many forward passes
Loss Masking
When training on chat data, role labels like system:, user:, assistant: should not be learned as generation targets. TrainingWindows::from_chat_sequences creates a binary loss mask:
loss_mask = 0.0for prompt tokens (system + user) — model sees them but doesn't learn to generate themloss_mask = 1.0for response tokens (assistant content) — model learns to generate these
This prevents the model from outputting role labels during inference.
Device Selection
use *;
// Recommended: auto-select best available backend
let device = auto_device?;
// Explicit CPU (only without --features cuda)
let device = cpu?;
// Explicit CUDA GPU (requires "cuda" feature)
let device = cuda?;
Parameter Budgets
Choose a model size with ParameterBudget — presets range from 1M to 100M parameters:
Params1M // ~1.2M
Params5M // ~5.5M
Params10M // ~10.5M (default)
Params50M // ~52.1M
Params100M // ~104.6M
Bundled Data
examples/data/ contains everything needed to run the examples out of the box:
tokenizer.model— SentencePiece model (5487 vocab)sample_chat.jsonl— 100,000 chat lines in OpenAI message format for training
The training example supports two data formats:
.txt— one sample per line.jsonl— each line is a JSON object:{"text": "your text here"}— raw text format{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}— OpenAI chat format
JSONL chat-format data is automatically detected and trained with loss masking.
Contributing
Contributions are welcome! Keep patches focused, maintain the default CPU/Flex path, and run:
License
MIT · Copyright (c) 2026 multiscreen-rs contributors.