TransXLab
The training architect: validate and design LLM fine-tuning configs before you spend a dollar on GPU time.
3.3 MB single binary. Zero Python dependencies. Catches the mistakes that cost you $665 and a weekend.
Landing Page | Blog: The $665 Postmortem | TransXform (training supervisor)
Install
cargo install transxlab
Quick Start
# Full pipeline: preflight + design + data strategy
# Validate environment and config only
# Design architecture from a HuggingFace model
# Postmortem on a failed run
What It Does
TransXLab runs a three-level pipeline over your training config:
| Level | Stage | Catches |
|---|---|---|
| 1 | Preflight | Bad env, missing files, GPU/VRAM mismatches, config errors |
| 2 | Design | Wrong architecture choices, unsafe hyperparameters, cost blowouts |
| 3 | Data Strategy | Quality gaps, contamination risk, diversity issues |
Under the hood: 20 failure-mode signatures, 25 hyperparameter rules, cloud cost estimation across 7 GPU tiers and 4 providers (RunPod, Lambda, AWS, Vast.ai).
HuggingFace Hub integration -- pass a model ID and TransXLab auto-detects architecture, parameter count, and recommended PEFT config.
Config generation for HF Trainer, Axolotl, LLaMA-Factory, and PEFT -- validated configs you can hand straight to your training framework.
CI/CD gating with --fail-on warn|fail and --json for machine-readable output.
Example: Catching the AC-v2 Disaster
The config that started this project -- a Flan-T5-XL creative-generation run that burned $665 before anyone noticed the problems:
$ transxlab full --config examples/ac_v2_config.yaml
TransXLab v0.1.0 -- Full Pipeline
==================================
[PREFLIGHT] 6 checks passed, 0 warnings, 0 failures
[DESIGN]
FAIL lr=1e-4 exceeds safe threshold for full fine-tuning (max 5e-5)
WARN full fine-tuning on 3B params -- consider LoRA/QLoRA to cut VRAM 60-80%
WARN no diversity loss for creative-generation task
[DATA STRATEGY]
WARN diversity_loss_weight=0.0 with task_type=creative generation
--> high mode-collapse risk
[COST]
Estimated: $142-$310 across providers (A100 80GB recommended)
Result: 2 failures, 2 warnings -- BLOCKED
Every issue flagged here went undetected in the real AC-v2 run. TransXLab exists so it doesn't happen again.
Companion: TransXform
TransXLab validates before training. TransXform supervises during training -- detecting loss anomalies, checkpoint corruption, and resource exhaustion in real time.
License
MIT