# Roadmap
This project follows the book **"Build a Large Language Model (From Scratch)"** by Sebastian Raschka as its primary guide, implemented in Rust.
---
## Goal 1 — Tokenization
Implement and benchmark different BPE variants before moving on. The goal is to understand the tradeoffs in tokenizer design and have a solid, tested implementation to carry forward.
### Milestones
- [ ] Naive BPE — character-level pair merging, no special tokens
- [ ] BPE with special tokens (`<|endoftext|>`, `<|unk|>`)
- [ ] BPE with pre-tokenization regex (GPT-2 style split before byte encoding)
- [ ] Byte-level BPE (no true unknowns, full UTF-8 coverage)
- [ ] Benchmark all variants against each other (see `docs/benchmarks/tokenizing/`)
- [ ] Pick one implementation to carry forward into training
---
## Goal 2 — Attention Mechanism
_Placeholder — to be defined after Goal 1 is complete._
---
## Goal 3 — GPT Model Architecture
_Placeholder — to be defined after Goal 2 is complete._
---
## Goal 4 — Pre-training
_Placeholder — to be defined after Goal 3 is complete._
---
## Goal 5 — Fine-tuning & Alignment
_Placeholder — to be defined after Goal 4 is complete._