large 0.2.0

Rust LLM inference implementation.
Documentation
# LARGE

**L**ightweight **A**rchitecture for **R**unning **G**enerative **E**ngines

An educational, from-scratch LLM inference engine written in Rust.
Loads a [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B-GGUF) model in GGUF format and generates text on CPU — no GPU, no frameworks, no magic.

## Features

- **GGUF parser** — reads the model file format with memory-mapped tensor access (zero-copy, 639 MB stays on disk)
- **BPE tokenizer** — GPT-2 byte-level BPE with Qwen2 pre-tokenization, loaded from GGUF metadata
- **Full transformer forward pass** — 28-layer decoder with Grouped Query Attention, QK-Norm, RoPE, and SwiGLU FFN
- **KV cache** — autoregressive generation with cached key/value states
- **Q8_0 dequantization** — fused quantized dot product (no intermediate buffers)
- **Sampling** — temperature scaling and top-p (nucleus) sampling
- **Streaming output** — tokens printed as they're generated
- **Minimal dependencies** — only `memmap2`, `byteorder`, and `regex`

## Quick Start

```bash
# Build
cargo build --release

# Download the model (~639 MB)
mkdir -p models
wget -O models/Qwen3-0.6B-Q8_0.gguf \
  "https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.gguf"

# Run inference
cargo run --release -- "What is the capital of France?"
```

### Example Output

```
LARGE — Qwen3-0.6B Inference Engine
====================================
Loading model from models/Qwen3-0.6B-Q8_0.gguf...
Building tokenizer...
Initializing model...
Ready! (28 layers, 151936 vocab, loaded in 0.23s)

Prompt: "What is the capital of France?"  (15 tokens after chat formatting)

Prefill: 15 tokens in 5.99s (2.5 tok/s)
<think>
Okay, the user is asking about the capital of France. Let me think.
I know that France's capital is Paris.
</think>

The capital of France is **Paris**.

Generated: 151 tokens in 48.81s (3.1 tok/s)
```

## Project Structure

```
large/
├── README.md              # This file
├── CLAUDE.md              # Project conventions & model architecture reference
├── Cargo.toml             # 3 dependencies: memmap2, byteorder, regex
├── models/                # Model files (git-ignored)
│   └── Qwen3-0.6B-Q8_0.gguf
├── src/
│   ├── main.rs            # CLI — tokenize, prefill, generate, stream output
│   ├── lib.rs             # Module declarations
│   ├── gguf.rs            # GGUF file parser (header, metadata, tensor index, mmap)
│   ├── tensor.rs          # f16 conversion, Q8_0 dequant, mat-vec, RMSNorm, RoPE, softmax
│   ├── tokenizer.rs       # GPT-2 byte-level BPE with Qwen2 pre-tokenization
│   ├── model.rs           # Qwen3 transformer (GQA, QK-Norm, SwiGLU, KV cache)
│   └── sampler.rs         # Temperature + top-p sampling, xorshift64 PRNG
└── docs/
    ├── architecture.md    # Inference pipeline walkthrough
    ├── tokenizer.md       # How the BPE tokenizer works
    └── quantization.md    # GGUF format and Q8_0 quantization
```

## Architecture

The model processes one token at a time through this pipeline:

```
token ID
  │
  ▼
Embedding Lookup (Q8_0 → f32, 151936 × 1024)
  │
  ▼
 28 × Transformer Block
  │  ┌─────────────────────────────────────────────────┐
  │  │ RMSNorm → Q/K/V Projections (Q8_0 mat-vec)     │
  │  │ QK-Norm (per-head RMSNorm on Q and K)           │
  │  │ RoPE (θ = 1,000,000)                            │
  │  │ GQA: 16 query heads, 8 KV heads (2:1 ratio)    │
  │  │ + Residual                                      │
  │  │                                                 │
  │  │ RMSNorm → SwiGLU FFN (gate/up/down, Q8_0)      │
  │  │ + Residual                                      │
  │  └─────────────────────────────────────────────────┘
  │
  ▼
Final RMSNorm
  │
  ▼
LM Head (tied embeddings, Q8_0 mat-vec → 151936 logits)
  │
  ▼
Temperature + Top-p Sampling → next token
```

See [docs/architecture.md](docs/architecture.md) for a detailed walkthrough.

## Running Tests

```bash
cargo test                    # 37 tests (unit + integration with model file)
cargo clippy                  # Lint (0 warnings)
cargo doc --open              # Generated API documentation
```

## Learn More

This project is built for learning. Start with the source code — every public function
has doc comments explaining *what* it does and *why*. Then dive into the docs:

- [docs/architecture.md]docs/architecture.md — How the inference pipeline works end-to-end
- [docs/tokenizer.md]docs/tokenizer.md — How text becomes tokens (and back)
- [docs/quantization.md]docs/quantization.md — GGUF format and Q8_0 quantization math
- [CLAUDE.md]CLAUDE.md — Model parameters, coding conventions, and design decisions

## Resources

- [Qwen3-0.6B Model Card]https://huggingface.co/Qwen/Qwen3-0.6B
- [GGUF Specification]https://github.com/ggerganov/ggml/blob/master/docs/gguf.md
- [llama.cpp]https://github.com/ggerganov/llama.cpp — Reference C++ implementation
- [The Illustrated Transformer]https://jalammar.github.io/illustrated-transformer/