LARGE

Lightweight Architecture for Running Generative Engines

An educational, from-scratch LLM inference engine written in Rust. Loads a Qwen3-0.6B model in GGUF format and generates text on CPU — no GPU, no frameworks, no magic.

Features

GGUF parser — reads the model file format with memory-mapped tensor access (zero-copy, 639 MB stays on disk)
BPE tokenizer — GPT-2 byte-level BPE with Qwen2 pre-tokenization, loaded from GGUF metadata
Full transformer forward pass — 28-layer decoder with Grouped Query Attention, QK-Norm, RoPE, and SwiGLU FFN
KV cache — autoregressive generation with cached key/value states
Q8_0 dequantization — fused quantized dot product (no intermediate buffers)
Sampling — temperature scaling and top-p (nucleus) sampling
Streaming output — tokens printed as they're generated
Minimal dependencies — only memmap2, byteorder, and regex

Quick Start

# Build
cargo build --release

# Download the model (~639 MB)
mkdir -p models
wget -O models/Qwen3-0.6B-Q8_0.gguf \
  "https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.gguf"

# Run inference
cargo run --release -- "What is the capital of France?"

Example Output

LARGE — Qwen3-0.6B Inference Engine
====================================
Loading model from models/Qwen3-0.6B-Q8_0.gguf...
Building tokenizer...
Initializing model...
Ready! (28 layers, 151936 vocab, loaded in 0.23s)

Prompt: "What is the capital of France?"  (15 tokens after chat formatting)

Prefill: 15 tokens in 5.99s (2.5 tok/s)
<think>
Okay, the user is asking about the capital of France. Let me think.
I know that France's capital is Paris.
</think>

The capital of France is **Paris**.

Generated: 151 tokens in 48.81s (3.1 tok/s)

Project Structure

large/
├── README.md              # This file
├── CLAUDE.md              # Project conventions & model architecture reference
├── Cargo.toml             # 3 dependencies: memmap2, byteorder, regex
├── models/                # Model files (git-ignored)
│   └── Qwen3-0.6B-Q8_0.gguf
├── src/
│   ├── main.rs            # CLI — tokenize, prefill, generate, stream output
│   ├── lib.rs             # Module declarations
│   ├── gguf.rs            # GGUF file parser (header, metadata, tensor index, mmap)
│   ├── tensor.rs          # f16 conversion, Q8_0 dequant, mat-vec, RMSNorm, RoPE, softmax
│   ├── tokenizer.rs       # GPT-2 byte-level BPE with Qwen2 pre-tokenization
│   ├── model.rs           # Qwen3 transformer (GQA, QK-Norm, SwiGLU, KV cache)
│   └── sampler.rs         # Temperature + top-p sampling, xorshift64 PRNG
└── docs/
    ├── architecture.md    # Inference pipeline walkthrough
    ├── tokenizer.md       # How the BPE tokenizer works
    └── quantization.md    # GGUF format and Q8_0 quantization

Architecture

The model processes one token at a time through this pipeline:

token ID
  │
  ▼
Embedding Lookup (Q8_0 → f32, 151936 × 1024)
  │
  ▼
 28 × Transformer Block
  │  ┌─────────────────────────────────────────────────┐
  │  │ RMSNorm → Q/K/V Projections (Q8_0 mat-vec)     │
  │  │ QK-Norm (per-head RMSNorm on Q and K)           │
  │  │ RoPE (θ = 1,000,000)                            │
  │  │ GQA: 16 query heads, 8 KV heads (2:1 ratio)    │
  │  │ + Residual                                      │
  │  │                                                 │
  │  │ RMSNorm → SwiGLU FFN (gate/up/down, Q8_0)      │
  │  │ + Residual                                      │
  │  └─────────────────────────────────────────────────┘
  │
  ▼
Final RMSNorm
  │
  ▼
LM Head (tied embeddings, Q8_0 mat-vec → 151936 logits)
  │
  ▼
Temperature + Top-p Sampling → next token

See docs/architecture.md for a detailed walkthrough.

Running Tests

cargo test                    # 37 tests (unit + integration with model file)
cargo clippy                  # Lint (0 warnings)
cargo doc --open              # Generated API documentation

Learn More

This project is built for learning. Start with the source code — every public function has doc comments explaining what it does and why. Then dive into the docs:

docs/architecture.md — How the inference pipeline works end-to-end
docs/tokenizer.md — How text becomes tokens (and back)
docs/quantization.md — GGUF format and Q8_0 quantization math
CLAUDE.md — Model parameters, coding conventions, and design decisions

Resources

Qwen3-0.6B Model Card
GGUF Specification
llama.cpp — Reference C++ implementation
The Illustrated Transformer

large 0.2.0