realizar 0.2.2

Pure Rust ML inference engine built from scratch - model serving for GGUF and safetensors
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
# Realizar ⚑

> **Pure Rust Model Serving - Built from Scratch**

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![Rust](https://img.shields.io/badge/rust-1.75%2B-blue.svg)](https://www.rust-lang.org)
[![Benchmark](https://img.shields.io/badge/vs_PyTorch-9.6x_faster-brightgreen.svg)](benches/comparative/BENCHMARK_RESULTS.md)

**Realizar** - Production ML inference engine built **100% from scratch** in pure Rust.

## πŸ“Š Benchmark: 9.6x Faster Than PyTorch

<p align="center">
  <img src="docs/assets/benchmark-comparison.svg" alt="Aprender vs PyTorch Benchmark" width="600">
</p>

For **CPU-only, single-request inference** (AWS Lambda, edge, embedded):

| Metric | Aprender (Rust) | PyTorch (Python) | Winner |
|--------|-----------------|------------------|--------|
| **Inference Latency (p50)** | 0.52 Β΅s | 5.00 Β΅s | **9.6x faster** |
| **Throughput** | 1,898,614/sec | 195,754/sec | **9.7x higher** |
| **Cold Start** | ~5 ms | ~500 ms+ | **100x faster** |
| **Package Size** | ~5 MB | ~500 MB+ | **100x smaller** |
| **Lambda Memory** | 128 MB | 512 MB+ | **4x less** |

**Statistical Validation:** p < 0.001, Cohen's d = 5.19 (large effect), 10,000 iterations

<details>
<summary>πŸ“ˆ Reproduce the benchmark</summary>

```bash
# Run Aprender benchmark
cargo run --example mnist_apr_benchmark --release --features aprender-serve

# Run PyTorch benchmark
cd benches/comparative
uv sync
uv run mnist_benchmark.py

# Generate comparison report
uv run compare_mnist.py
```

See [BENCHMARK_RESULTS.md](benches/comparative/BENCHMARK_RESULTS.md) for full methodology.

</details>

### Why 9.6x Faster?

```
PyTorch (5.00 Β΅s):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Python  β”‚ Bridge  β”‚ Checks  β”‚ COMPUTE β”‚ Alloc   β”‚ Return  β”‚
β”‚ interp  β”‚ FFI     β”‚ dispatchβ”‚ (real)  β”‚ tensor  β”‚ to Py   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              ↑ Only 10% is actual work

Aprender (0.52 Β΅s):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”
β”‚  COMPUTE  β”‚retβ”‚  ← 77% is actual work
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”˜
```

**Bottom line:** For `.apr` models on Lambda/edge, Aprender eliminates Python entirelyβ€”faster, smaller, cheaper.

## AWS Lambda: 53,000x Faster Cold Start

<p align="center">
  <img src="docs/assets/lambda-apr-vs-pytorch.svg" alt="Lambda APR vs PyTorch" width="600">
</p>

For serverless deployment, the `.apr` format **dominates** PyTorch:

| Metric | .apr (Rust) | PyTorch | Improvement |
|--------|-------------|---------|-------------|
| **Cold Start** | 15Β΅s | 800ms | **53,000x faster** |
| **Inference** | 0.6Β΅s | 5.0Β΅s | **8.5x faster** |
| **Binary Size** | 3.2KB | >100MB | **30,000x smaller** |
| **Lambda Memory** | 128MB | 512MB+ | **4x less** |

### 100% Reproducible Lambda Deployment

The model file is **checked into git** for byte-for-byte reproducibility:

```bash
# Model is already in the repo
ls -la models/mnist_784x2.apr  # 3,248 bytes

# Build Lambda binary (uses checked-in model)
make lambda-build

# Package for AWS
make lambda-package

# Run locally
make lambda-bench
```

See the [Lambda MNIST Benchmark](book/book/lambda/mnist-benchmark.html) chapter for full details.

<details>
<summary>πŸ“£ Share on LinkedIn</summary>

Copy-paste for LinkedIn:

---

**We benchmarked Rust vs Python for ML inference. The results: 9.6x faster.**

For CPU-only, single-request inference (AWS Lambda, edge devices):

- Latency: 0.52Β΅s (Rust) vs 5.0Β΅s (Python) β€” **9.6x faster**
- Cold start: 5ms vs 500ms+ β€” **100x faster**
- Package: 5MB vs 500MB β€” **100x smaller**
- Lambda RAM: 128MB vs 512MB β€” **4x less**

Why? Python's interpreter + FFI bridge overhead dominates small operations. 90% of PyTorch inference time is overhead, only 10% is actual compute.

Statistically validated: p < 0.001, Cohen's d = 5.19, 10,000 iterations, 100-point QA checklist.

Full methodology + reproducible benchmark: github.com/paiml/realizar

#MachineLearning #Rust #Python #AWS #Lambda #Performance #MLOps

---

</details>

## πŸš€ Quick Start

```bash
# Build the binary
cargo build --release

# Start the inference server (demo mode)
./target/release/realizar serve --demo --port 8080

# Test the API
curl http://127.0.0.1:8080/health
curl -X POST http://127.0.0.1:8080/tokenize \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world"}'
curl -X POST http://127.0.0.1:8080/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello", "max_tokens": 10, "strategy": "greedy"}'

# View help
./target/release/realizar --help
./target/release/realizar serve --help
```

## βš™οΈ Feature Flags

Realizar supports modular compilation through feature flags:

```toml
[dependencies]
realizar = { version = "0.1", default-features = false, features = ["minimal"] }
```

**Available Features:**
- `default` = `["server", "cli", "gpu"]` - Full functionality
- `minimal` = `[]` - Core inference engine only (no server, no CLI)
- `server` - REST API server (requires axum, tokio)
- `cli` - Command-line interface (requires clap)
- `gpu` - GPU acceleration via Trueno
- `full` - Alias for all features

**Examples:**

```bash
# Core inference library only (minimal dependencies)
cargo build --no-default-features --features minimal

# Server without CLI
cargo build --no-default-features --features server,gpu

# Everything enabled
cargo build --features full
```

## 🎯 Philosophy

**Total Control, Zero Compromise**

Build everything ourselves except HTTP infrastructure:
- βœ… **Transformer architecture** - Our code, Trueno-backed
- βœ… **Quantization** - Q4_0, Q8_0, Q4_K from scratch
- βœ… **Model parsing** - GGUF, safetensors native readers
- βœ… **Token encoding** - BPE, SentencePiece in pure Rust
- βœ… **Inference engine** - Every optimization under our control
- πŸ”§ **HTTP server** - axum (swappable via trait)

## πŸš€ Target API

```rust
use realizar::{Model, Server};

// Load model (our loader, our format parsing)
let model = Model::from_gguf("models/llama-3.2-1b.gguf")?;

// Serve (swappable server backend)
Server::new(model)
    .with_gpu()
    .serve("0.0.0.0:8080")?;
```

```bash
# CLI
realizar serve --model llama-3.2-1b.gguf --port 8080

# REST API
curl -X POST http://localhost:8080/generate \
  -d '{"prompt": "Hello", "max_tokens": 100}'

# Metrics (Prometheus format)
curl http://localhost:8080/metrics
```

## πŸ—οΈ Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  HTTP Server (Swappable)           β”‚
β”‚  - axum (default, trait-based)     β”‚
β”‚  - hyper (future)                  β”‚
β”‚  - actix-web (future)              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Inference Engine (FROM SCRATCH)   β”‚
β”‚  - Transformer (our code)          β”‚
β”‚  - Attention (Trueno-backed)       β”‚
β”‚  - Quantization (our algorithms)   β”‚
β”‚  - KV cache (our management)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Model Loader (FROM SCRATCH)       β”‚
β”‚  - GGUF parser (pure Rust)         β”‚
β”‚  - Safetensors reader (pure Rust)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Trueno (Compute Primitives)       β”‚
β”‚  - Matrix ops (SIMD/GPU)           β”‚
β”‚  - Vector ops (AVX2/NEON)          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## πŸ“¦ Dependencies (Minimal)

```toml
[dependencies]
# OUR ecosystem - we control these
trueno = { path = "../trueno" }  # SIMD/GPU compute primitives

# HTTP server ONLY (swappable via trait)
axum = "0.7"
tokio = { version = "1", features = ["rt-multi-thread"] }

# CLI
clap = { version = "4", features = ["derive"] }

# Serialization (for API only, not ML)
serde = { version = "1", features = ["derive"] }
serde_json = "1"

# That's it. NO candle, NO llama-cpp-rs, NO hf-hub
```

## πŸ”§ What We Build from Scratch

### 1. Model Formats (Pure Rust Parsers)
- **GGUF** - Ollama/llama.cpp format
- **Safetensors** - HuggingFace format
- No external dependencies, complete control

### 2. Transformer Architecture
```rust
pub struct Transformer {
    layers: Vec<TransformerLayer>,
    config: ModelConfig,
}

impl Transformer {
    pub fn forward(&self, tokens: &[u32]) -> Tensor {
        // Our implementation, Trueno ops
        let x = self.embed(tokens);
        for layer in &self.layers {
            x = layer.forward(x);  // We write this
        }
        self.lm_head(x)
    }
}
```

### 3. Attention Mechanism
```rust
pub fn attention(
    q: &Tensor,  // Trueno tensor
    k: &Tensor,
    v: &Tensor,
) -> Tensor {
    // Our attention implementation
    // Uses Trueno for matrix ops (SIMD/GPU)
    let scores = q.matmul(&k.transpose());
    let weights = scores.softmax();
    weights.matmul(v)
}
```

### 4. Quantization
```rust
pub mod quantize {
    // Q4_0 - 4-bit quantization
    pub fn q4_0(weights: &[f32]) -> (Vec<u8>, Vec<f32>) { }

    // Q8_0 - 8-bit quantization
    pub fn q8_0(weights: &[f32]) -> (Vec<i8>, Vec<f32>) { }

    // Q4_K - k-quant 4-bit
    pub fn q4_k(weights: &[f32]) -> Vec<u8> { }

    // Dequantization for inference
    pub fn dequantize(data: &[u8], qtype: QuantType) -> Vec<f32> { }
}
```

### 5. Token Encoding
```rust
pub struct Tokenizer {
    vocab: HashMap<String, u32>,
    merges: Vec<(String, String)>,
}

impl Tokenizer {
    // BPE encoding (from scratch)
    pub fn encode(&self, text: &str) -> Vec<u32> { }

    // Decoding
    pub fn decode(&self, tokens: &[u32]) -> String { }
}
```

### 6. KV Cache
```rust
pub struct KVCache {
    keys: Vec<Tensor>,    // Trueno tensors
    values: Vec<Tensor>,
}

impl KVCache {
    // Efficient cache management
    pub fn update(&mut self, layer: usize, k: Tensor, v: Tensor) { }
    pub fn get(&self, layer: usize) -> (&Tensor, &Tensor) { }
}
```

## πŸ”Œ Swappable HTTP Server

```rust
// HTTP server trait (axum is default, can swap)
pub trait HttpServer {
    fn serve(&self, addr: &str) -> Result<()>;
}

// Default: axum
pub struct AxumServer { /* ... */ }
impl HttpServer for AxumServer { /* ... */ }

// Future: hyper, actix-web, custom
pub struct HyperServer { /* ... */ }
impl HttpServer for HyperServer { /* ... */ }

// Usage
let server = Server::new(model)
    .with_backend(AxumServer::new())  // or HyperServer
    .serve("0.0.0.0:8080")?;
```

## πŸ’‘ Examples

Realizar includes **6 comprehensive examples** demonstrating all major features:

### 1. End-to-End Inference (`inference.rs`)
Complete text generation pipeline with model initialization, forward pass, and multiple sampling strategies (greedy, top-k, top-p).

```bash
cargo run --example inference
```

### 2. HTTP API Server (`api_server.rs`)
Deploy Realizar as a REST API service with demo model, handling tokenization and generation requests.

```bash
cargo run --example api_server
# Server runs at http://127.0.0.1:3000
# Test: curl http://127.0.0.1:3000/health
```

### 3. Tokenization (`tokenization.rs`)
Compare different tokenization strategies: Basic (vocabulary-based), BPE (Byte Pair Encoding), and SentencePiece.

```bash
cargo run --example tokenization
```

### 4. SafeTensors Loading (`safetensors_loading.rs`)
Load and inspect SafeTensors files (aprender compatibility), extract tensor data, interoperate with aprender-trained models.

```bash
cargo run --example safetensors_loading
```

### 5. Model Caching (`model_cache.rs`)
Demonstrate ModelCache for efficient model reuse with LRU eviction, metrics tracking, and config-based cache keys.

```bash
cargo run --example model_cache
```

### 6. GGUF Format Loading (`gguf_loading.rs`)
Load and inspect GGUF files (llama.cpp/Ollama format), parse headers and metadata, extract tensor data with dequantization support.

```bash
cargo run --example gguf_loading
```

See [`examples/README.md`](examples/README.md) for detailed documentation.

## ⚑ Reproducible Benchmarks

Realizar provides **scientifically rigorous, reproducible benchmarks** following [MLPerfβ„’ Inference](https://mlcommons.org/benchmarks/inference/) methodology. All benchmarks use [Criterion.rs](https://bheisler.github.io/criterion.rs/book/) for statistical analysis with 95% confidence intervals.

### Quick Start

```bash
# Run all Realizar benchmarks
cargo bench

# Run comparative benchmarks (Realizar vs PyTorch)
make bench-comparative

# CLI benchmark commands
./target/release/realizar bench --list
./target/release/realizar bench tensor_ops
./target/release/realizar viz --samples 100
```

### Benchmark Suites

| Suite | Command | Description |
|-------|---------|-------------|
| `tensor_ops` | `cargo bench --bench tensor_ops` | Tensor creation, shape access, indexing |
| `inference` | `cargo bench --bench inference` | End-to-end token generation |
| `cache` | `cargo bench --bench cache` | KV cache hit/miss, eviction |
| `tokenizer` | `cargo bench --bench tokenizer` | BPE/SentencePiece encode/decode |
| `quantize` | `cargo bench --bench quantize` | Q4_0/Q8_0 dequantization |
| `comparative` | `cargo bench --bench comparative` | MNIST, CIFAR-10, Iris vs PyTorch |

### Reproducing Results

**Prerequisites:**

```bash
# Rust toolchain
rustup default stable
rustup update

# Python environment (uv)
curl -LsSf https://astral.sh/uv/install.sh | sh

# PyTorch dependencies
cd benches/comparative
uv sync
```

**Hardware Requirements:**

- CPU: x86_64 with AVX2 or ARM64 with NEON
- RAM: 8GB minimum
- Recommended: Disable CPU frequency scaling for stable measurements

```bash
# Linux: Set performance governor
sudo cpupower frequency-set --governor performance
```

**Step-by-Step Reproduction:**

```bash
# 1. Clone and build
git clone https://github.com/paiml/realizar.git
cd realizar
cargo build --release

# 2. Run Realizar benchmarks
cargo bench --bench tensor_ops
cargo bench --bench cache
cargo bench --bench comparative

# 3. Run PyTorch baseline (requires uv)
cd benches/comparative
uv sync
uv run pytorch_baseline.py --all --output pytorch_results.json

# 4. Generate comparison report
uv run run_comparison.py --output comparison_report.md

# 5. View HTML reports
open target/criterion/report/index.html
```

### Datasets

Benchmarks use canonical ML datasets via [Alimentar](https://github.com/paiml/alimentar) for PyTorch parity:

| Dataset | Dimensions | Classes | Features |
|---------|------------|---------|----------|
| **MNIST** | 28Γ—28Γ—1 | 10 | 784 |
| **CIFAR-10** | 32Γ—32Γ—3 | 10 | 3,072 |
| **Fashion-MNIST** | 28Γ—28Γ—1 | 10 | 784 |
| **Iris** | Tabular | 3 | 4 |

### Comparative Framework Testing

We benchmark against PyTorch under equivalent conditions:

| Setting | Value |
|---------|-------|
| Threads | 1 (single-threaded) |
| Batch sizes | 1, 8, 32 |
| Device | CPU only |
| Warm-up | 50 iterations |
| Measurement | 1000 iterations |

**Run comparative benchmarks:**

```bash
# Full comparison (Makefile)
make bench-comparative

# Manual execution
cargo bench --bench comparative
uv run benches/comparative/pytorch_baseline.py --all
uv run benches/comparative/run_comparison.py
```

### Performance Results

**Realizar (v0.2.1) - Intel Core i7, Linux 6.8:**

| Benchmark | Batch | Latency (p50) | Throughput |
|-----------|-------|---------------|------------|
| MNIST inference | 1 | 780 ns | 1.28M samples/s |
| MNIST inference | 32 | 23.8 Β΅s | 1.34M samples/s |
| CIFAR-10 inference | 1 | 1.58 Β΅s | 633K samples/s |
| CIFAR-10 inference | 32 | 49.8 Β΅s | 642K samples/s |
| Iris inference | 32 | 210 ns | 152M samples/s |
| Tensor creation (10) | - | 18 ns | - |
| Tensor creation (10K) | - | 643 ns | - |
| Cache hit | - | 39 ns | - |

### Statistical Methodology

- **Warm-up phase**: Stabilize CPU caches and branch predictors
- **Sample collection**: 100 samples per benchmark (Criterion default)
- **Confidence intervals**: 95% CI reported as [lower, mean, upper]
- **Regression detection**: Automatic comparison against baseline
- **Effect size**: Cohen's d for practical significance

```
tensor_creation/10      time:   [17.887 ns 17.966 ns 18.043 ns]
                                 ^         ^         ^
                              lower      mean      upper
                              bound    estimate    bound
```

### Visualization

```bash
# Terminal visualization
./target/release/realizar viz

# Output includes:
# - Sparklines (trend visualization)
# - ASCII histograms (distribution shape)
# - Statistical summary (mean, std_dev, p50/p95/p99)
# - Multi-benchmark comparison tables
```

### References

1. MLPerfβ„’ Inference Benchmark Suite. MLCommons. https://mlcommons.org/benchmarks/inference/
2. Criterion.rs: Statistics-driven Microbenchmarking. https://bheisler.github.io/criterion.rs/book/
3. Box, G. E. P., Hunter, J. S., & Hunter, W. G. (2005). *Statistics for Experimenters*. Wiley.
4. Fleming, P. J., & Wallace, J. J. (1986). How not to lie with statistics. *CACM*, 29(3), 218-221.

## πŸ“Š Roadmap

### Phase 1: Core Inference (Weeks 1-8) βœ… COMPLETE

**Build from scratch:**
- βœ… GGUF parser (binary format reader)
- βœ… Safetensors parser (zero-copy reader)
- βœ… Transformer architecture (attention, FFN, LayerNorm, RoPE)
- βœ… Quantization (Q4_0, Q8_0, dequantization)
- βœ… Tokenizer (BPE, SentencePiece)
- βœ… KV cache management
- βœ… Inference engine (generation loop, greedy/top-k/top-p)
- βœ… HTTP server with axum (REST API)
- βœ… CLI: `realizar serve --demo` (model loading in Phase 2)
- βœ… 260 tests (211 unit + 42 property + 7 integration), 94.61% coverage

**Success criteria:**
- βœ… GGUF and Safetensors parsers working
- βœ… Quantization working (Q4_0, Q8_0)
- βœ… REST API with /health, /tokenize, /generate
- βœ… GPU acceleration via Trueno
- βœ… Zero external ML dependencies
- βœ… TDG Score: 93.9/100 (A)

### Phase 2: Optimization (Weeks 9-16) βœ… COMPLETE

- βœ… Advanced quantization (Q4_K, Q5_K, Q6_K)
- βœ… Flash Attention (memory-efficient block-wise computation)
- βœ… Batch inference
- βœ… Streaming responses (SSE)
- βœ… Model caching/warming
- βœ… Benchmarks vs llama.cpp

### Phase 3: Advanced Models (Weeks 17-24)

- βœ… Multi-query attention (MQA)
- βœ… Grouped-query attention (GQA)
- βœ… RoPE position embeddings
- βœ… ALiBi position embeddings
- [ ] Vision models (LLaVA, Qwen-VL)

### Phase 4: Production (Weeks 25-32) βœ… COMPLETE

- βœ… Multi-model serving (ModelRegistry with concurrent access)
- βœ… Request batching (batch tokenize & generate endpoints)
- βœ… Monitoring/metrics (Prometheus-compatible /metrics endpoint)
- βœ… Docker + GPU support (Dockerfile, docker-compose, Kubernetes, AWS ECS)
- βœ… Load testing (Rust-based load test client, 7 scenarios, performance targets)

## πŸ› οΈ Development

```bash
# Build
cargo build --release

# Test
cargo test

# Quality gates
make quality-gates

# Run (when implemented)
cargo run --release -- serve --model llama-3.2-1b.gguf --port 8080
```

## πŸ“š Documentation

Comprehensive documentation is available as an mdBook:

```bash
# Build and view the book
make book

# Build only
make book-build

# Live reload (for writing docs)
make book-serve

# Open in browser
make book-open
```

The book covers:
- **Core Architecture** - Design philosophy, Trueno integration, feature flags
- **Model Formats** - GGUF and Safetensors parsing from scratch
- **Quantization** - Q4_0, Q8_0, and K-quant algorithms
- **Transformer Architecture** - Attention, RoPE, FFN, KV cache implementation
- **Tokenization** - BPE and SentencePiece without external libraries
- **REST API & CLI** - Production HTTP server and command-line interface
- **GPU Acceleration** - Trueno SIMD/GPU dispatch
- **EXTREME TDD** - Property-based testing, mutation testing methodology
- **Development Phases** - Phase 1-4 roadmap and implementation details

**Note:** Book structure is validated in `make quality-gates` to ensure documentation stays in sync with code.

## πŸŽ“ Learning Resources

We're building everything from scratch. Key papers:
- **[11] TensorFlow** - Model serving architecture
- **[12] PyTorch** - Imperative ML framework design
- **[13] NumPy** - N-dimensional array design
- **[18] BLAS** - Linear algebra API design
- **[19] Strassen** - Fast matrix multiplication
- **[20] Kahan** - Numerical stability

Full spec: [docs/specifications/pure-rust-ml-library-research-spec.md](docs/specifications/pure-rust-ml-library-research-spec.md)

## πŸ”’ Security

- **Pure Rust** - Memory safe by design
- **Zero unsafe** in public API
- **Minimal deps** - axum + tokio only for HTTP
- `cargo audit` pre-commit
- `cargo-deny` license checks

## 🀝 Contributing

1. Fork repo
2. EXTREME TDD (tests first)
3. `make quality-gates` passes
4. All commits on `master`

## πŸ“„ License

MIT License - see [LICENSE](LICENSE)

## πŸ™ Acknowledgments

- **[Trueno]https://github.com/paiml/trueno** - SIMD/GPU compute primitives (our ecosystem)
- **[Aprender]https://github.com/paiml/aprender** - ML algorithms (Phase 2+)
- **[Renacer]https://github.com/paiml/renacer** - Profiling
- **[paiml-mcp-agent-toolkit]https://github.com/paiml/paiml-mcp-agent-toolkit** - Quality gates
- **[bashrs]https://github.com/paiml/bashrs** - Script enforcement

Developed by [Pragmatic AI Labs](https://paiml.com)

---

**Built from SCRATCH with EXTREME TDD** πŸ¦€βš‘