# LARGE
**L**ightweight **A**rchitecture for **R**unning **G**enerative **E**ngines
An educational, from-scratch LLM inference engine written in Rust.
Loads a [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B-GGUF) model in GGUF format and generates text on CPU — no GPU, no frameworks, no magic.
## Features
- **GGUF parser** — reads the model file format with memory-mapped tensor access (zero-copy, 639 MB stays on disk)
- **BPE tokenizer** — GPT-2 byte-level BPE with Qwen2 pre-tokenization, loaded from GGUF metadata
- **Full transformer forward pass** — 28-layer decoder with Grouped Query Attention, QK-Norm, RoPE, and SwiGLU FFN
- **KV cache** — autoregressive generation with cached key/value states
- **Q8_0 dequantization** — fused quantized dot product (no intermediate buffers)
- **Sampling** — temperature scaling and top-p (nucleus) sampling
- **Streaming output** — tokens printed as they're generated
- **Minimal dependencies** — only `memmap2`, `byteorder`, and `regex`
## Quick Start
```bash
# Build
cargo build --release
# Download the model (~639 MB)
mkdir -p models
wget -O models/Qwen3-0.6B-Q8_0.gguf \
"https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.gguf"
# Run inference
cargo run --release -- "What is the capital of France?"
```
### Example Output
```
LARGE — Qwen3-0.6B Inference Engine
====================================
Loading model from models/Qwen3-0.6B-Q8_0.gguf...
Building tokenizer...
Initializing model...
Ready! (28 layers, 151936 vocab, loaded in 0.23s)
Prompt: "What is the capital of France?" (15 tokens after chat formatting)
Prefill: 15 tokens in 5.99s (2.5 tok/s)
<think>
Okay, the user is asking about the capital of France. Let me think.
I know that France's capital is Paris.
</think>
The capital of France is **Paris**.
Generated: 151 tokens in 48.81s (3.1 tok/s)
```
## Project Structure
```
large/
├── README.md # This file
├── CLAUDE.md # Project conventions & model architecture reference
├── Cargo.toml # 3 dependencies: memmap2, byteorder, regex
├── models/ # Model files (git-ignored)
│ └── Qwen3-0.6B-Q8_0.gguf
├── src/
│ ├── main.rs # CLI — tokenize, prefill, generate, stream output
│ ├── lib.rs # Module declarations
│ ├── gguf.rs # GGUF file parser (header, metadata, tensor index, mmap)
│ ├── tensor.rs # f16 conversion, Q8_0 dequant, mat-vec, RMSNorm, RoPE, softmax
│ ├── tokenizer.rs # GPT-2 byte-level BPE with Qwen2 pre-tokenization
│ ├── model.rs # Qwen3 transformer (GQA, QK-Norm, SwiGLU, KV cache)
│ └── sampler.rs # Temperature + top-p sampling, xorshift64 PRNG
└── docs/
├── architecture.md # Inference pipeline walkthrough
├── tokenizer.md # How the BPE tokenizer works
└── quantization.md # GGUF format and Q8_0 quantization
```
## Architecture
The model processes one token at a time through this pipeline:
```
token ID
│
▼
Embedding Lookup (Q8_0 → f32, 151936 × 1024)
│
▼
28 × Transformer Block
│ ┌─────────────────────────────────────────────────┐
│ │ RMSNorm → Q/K/V Projections (Q8_0 mat-vec) │
│ │ QK-Norm (per-head RMSNorm on Q and K) │
│ │ RoPE (θ = 1,000,000) │
│ │ GQA: 16 query heads, 8 KV heads (2:1 ratio) │
│ │ + Residual │
│ │ │
│ │ RMSNorm → SwiGLU FFN (gate/up/down, Q8_0) │
│ │ + Residual │
│ └─────────────────────────────────────────────────┘
│
▼
Final RMSNorm
│
▼
LM Head (tied embeddings, Q8_0 mat-vec → 151936 logits)
│
▼
Temperature + Top-p Sampling → next token
```
See [docs/architecture.md](docs/architecture.md) for a detailed walkthrough.
## Running Tests
```bash
cargo test # 37 tests (unit + integration with model file)
cargo clippy # Lint (0 warnings)
cargo doc --open # Generated API documentation
```
## Learn More
This project is built for learning. Start with the source code — every public function
has doc comments explaining *what* it does and *why*. Then dive into the docs:
- [docs/architecture.md](docs/architecture.md) — How the inference pipeline works end-to-end
- [docs/tokenizer.md](docs/tokenizer.md) — How text becomes tokens (and back)
- [docs/quantization.md](docs/quantization.md) — GGUF format and Q8_0 quantization math
- [CLAUDE.md](CLAUDE.md) — Model parameters, coding conventions, and design decisions
## Resources
- [Qwen3-0.6B Model Card](https://huggingface.co/Qwen/Qwen3-0.6B)
- [GGUF Specification](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
- [llama.cpp](https://github.com/ggerganov/llama.cpp) — Reference C++ implementation
- [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)