hermes-llm
Train Large Language Models from scratch in Rust using Candle.
Features
- GPT-style Transformer Architecture with RoPE positional embeddings
- BPE Tokenizer Training using HuggingFace tokenizers
- Multiple Model Configurations: tiny, GPT-2 small/medium/large, LLaMA-style
- Training Infrastructure: AdamW optimizer, gradient clipping, checkpointing
- Text Generation: Temperature sampling, top-k sampling
- Backend Support: CPU, CUDA, Metal (Apple Silicon), Accelerate
Installation
# CPU only (default)
# With CUDA support
# With Metal support (macOS)
# With Accelerate (macOS)
Usage
Train a tokenizer
Train a model
Generate text
Show model info
Multi-GPU Training (NCCL)
For distributed training, just add --num-gpus:
# Build with NCCL support
# Single GPU
# 4 GPUs (automatically uses NCCL)
Training Options
| Option | Default | Description |
|---|---|---|
--data |
(stdin) | Training data file |
--tokenizer |
required | Tokenizer file path |
--model |
tiny | Model preset |
--num-gpus |
1 | Number of GPUs (>1 enables NCCL) |
--batch-size |
32 | Batch size per GPU |
--grad-accum |
1 | Gradient accumulation steps |
--epochs |
1 | Training epochs |
--lr |
3e-4 | Learning rate |
--output |
checkpoints | Output directory |
Effective Batch Size
effective_batch = batch_size × grad_accum × num_gpus
Example: --batch-size 32 --grad-accum 4 --num-gpus 4 = 512 effective batch
Fine-tuning
Continue training from a pre-trained checkpoint:
Fine-tuning Options
| Option | Description |
|---|---|
--checkpoint |
Path to pre-trained weights (.safetensors) |
--freeze-layers |
Number of layers to freeze from bottom (default: 0) |
--lr |
Use lower LR for fine-tuning (e.g., 1e-5) |
Freezing Layers
Freeze early layers to preserve general knowledge while adapting top layers:
Direct Preference Optimization (DPO)
Align your model to human preferences without a separate reward model:
Preference Data Format
JSONL file with prompt, chosen, and rejected fields:
DPO Options
| Option | Default | Description |
|---|---|---|
--checkpoint |
required | SFT model to start from |
--config |
required | Model config JSON |
--data |
required | Preference pairs (JSONL) |
--beta |
0.1 | KL divergence penalty |
--lr |
5e-7 | Learning rate (very low for DPO) |
--max-len |
512 | Max sequence length |
--output |
checkpoints-dpo | Output directory |
Model Configurations
| Config | Layers | Hidden | Heads | Params (32K vocab) |
|---|---|---|---|---|
| nano | 2 | 64 | 2 | ~4M |
| tiny | 4 | 128 | 4 | ~9M |
| gpt2-small | 12 | 768 | 12 | ~124M |
| gpt2-medium | 24 | 1024 | 16 | ~355M |
| gpt2-large | 36 | 1280 | 20 | ~774M |
| llama-small | 16 | 1024 | 16 | ~268M |
Note: Parameter count depends heavily on vocab size. Run hermes-llm info --model <name> for exact counts.
Architecture
The model implements a modern transformer architecture:
- Embeddings: Token embeddings (no position embeddings - uses RoPE)
- Attention: Multi-head self-attention with RoPE (Rotary Position Embedding)
- Normalization: RMSNorm (pre-normalization)
- FFN: SwiGLU activation for LLaMA-style, GELU for GPT-style
- Output: Tied embeddings with language modeling head
Library Usage
use ;
use TrainingConfig;
use ;
use Tokenizer;
use Device;
// Load or train tokenizer
let tokenizer = from_file?;
// Create model config
let mut config = tiny;
config.vocab_size = tokenizer.vocab_size;
// Load dataset
let dataset = from_file?;
let mut loader = new;
// Create trainer
let device = Cpu;
let training_config = default;
let mut trainer = new?;
// Train
trainer.train?;
References
- Attention Is All You Need
- Language Models are Unsupervised Multitask Learners (GPT-2)
- LLaMA: Open and Efficient Foundation Language Models
- RoFormer: Enhanced Transformer with Rotary Position Embedding
- Candle ML Framework
License
MIT