LARGE
Lightweight Architecture for Running Generative Engines
An educational, from-scratch LLM inference engine written in Rust. Loads a Qwen3-0.6B model in GGUF format and generates text on CPU — no GPU, no frameworks, no magic.
Features
- GGUF parser — reads the model file format with memory-mapped tensor access (zero-copy, 639 MB stays on disk)
- BPE tokenizer — GPT-2 byte-level BPE with Qwen2 pre-tokenization, loaded from GGUF metadata
- Full transformer forward pass — 28-layer decoder with Grouped Query Attention, QK-Norm, RoPE, and SwiGLU FFN
- KV cache — autoregressive generation with cached key/value states
- Q8_0 dequantization — fused quantized dot product (no intermediate buffers)
- Sampling — temperature scaling and top-p (nucleus) sampling
- Streaming output — tokens printed as they're generated
- Minimal dependencies — only
memmap2,byteorder, andregex
Quick Start
# Build
# Download the model (~639 MB)
# Run inference
Example Output
LARGE — Qwen3-0.6B Inference Engine
====================================
Loading model from models/Qwen3-0.6B-Q8_0.gguf...
Building tokenizer...
Initializing model...
Ready! (28 layers, 151936 vocab, loaded in 0.23s)
Prompt: "What is the capital of France?" (15 tokens after chat formatting)
Prefill: 15 tokens in 5.99s (2.5 tok/s)
<think>
Okay, the user is asking about the capital of France. Let me think.
I know that France's capital is Paris.
</think>
The capital of France is **Paris**.
Generated: 151 tokens in 48.81s (3.1 tok/s)
Project Structure
large/
├── README.md # This file
├── CLAUDE.md # Project conventions & model architecture reference
├── Cargo.toml # 3 dependencies: memmap2, byteorder, regex
├── models/ # Model files (git-ignored)
│ └── Qwen3-0.6B-Q8_0.gguf
├── src/
│ ├── main.rs # CLI — tokenize, prefill, generate, stream output
│ ├── lib.rs # Module declarations
│ ├── gguf.rs # GGUF file parser (header, metadata, tensor index, mmap)
│ ├── tensor.rs # f16 conversion, Q8_0 dequant, mat-vec, RMSNorm, RoPE, softmax
│ ├── tokenizer.rs # GPT-2 byte-level BPE with Qwen2 pre-tokenization
│ ├── model.rs # Qwen3 transformer (GQA, QK-Norm, SwiGLU, KV cache)
│ └── sampler.rs # Temperature + top-p sampling, xorshift64 PRNG
└── docs/
├── architecture.md # Inference pipeline walkthrough
├── tokenizer.md # How the BPE tokenizer works
└── quantization.md # GGUF format and Q8_0 quantization
Architecture
The model processes one token at a time through this pipeline:
token ID
│
▼
Embedding Lookup (Q8_0 → f32, 151936 × 1024)
│
▼
28 × Transformer Block
│ ┌─────────────────────────────────────────────────┐
│ │ RMSNorm → Q/K/V Projections (Q8_0 mat-vec) │
│ │ QK-Norm (per-head RMSNorm on Q and K) │
│ │ RoPE (θ = 1,000,000) │
│ │ GQA: 16 query heads, 8 KV heads (2:1 ratio) │
│ │ + Residual │
│ │ │
│ │ RMSNorm → SwiGLU FFN (gate/up/down, Q8_0) │
│ │ + Residual │
│ └─────────────────────────────────────────────────┘
│
▼
Final RMSNorm
│
▼
LM Head (tied embeddings, Q8_0 mat-vec → 151936 logits)
│
▼
Temperature + Top-p Sampling → next token
See docs/architecture.md for a detailed walkthrough.
Running Tests
Learn More
This project is built for learning. Start with the source code — every public function has doc comments explaining what it does and why. Then dive into the docs:
- docs/architecture.md — How the inference pipeline works end-to-end
- docs/tokenizer.md — How text becomes tokens (and back)
- docs/quantization.md — GGUF format and Q8_0 quantization math
- CLAUDE.md — Model parameters, coding conventions, and design decisions
Resources
- Qwen3-0.6B Model Card
- GGUF Specification
- llama.cpp — Reference C++ implementation
- The Illustrated Transformer