aprender 0.29.3

# The .apr Format: A Five Whys Deep Dive

Why does aprender use its own model format instead of GGUF, SafeTensors, or ONNX? This chapter applies Toyota's **Five Whys** methodology to explain every design decision and preemptively address skepticism.

## Executive Summary

| Feature | .apr | GGUF | SafeTensors | ONNX |
|---------|------|------|-------------|------|
| Pure Rust | **Yes** | No (C/C++) | Partial | No (C++) |
| WASM | **Native** | No | Limited | No |
| Single Binary Embed | **Yes** | No | No | No |
| Encryption | **AES-256-GCM** | No | No | No |
| ARM/Embedded | **Native** | Requires porting | Limited | Requires runtime |
| trueno SIMD | **Native** | N/A | N/A | N/A |
| File Size Overhead | **32 bytes** | ~1KB | ~100 bytes | ~10KB |

## The Five Whys: Why Not Just Use GGUF?

### Why #1: Why create a new format at all?

**Skeptic:** "GGUF is the industry standard for LLMs. Why reinvent the wheel?"

**Answer:** GGUF solves a different problem. It's optimized for *loading pre-trained LLMs into llama.cpp*. We need a format optimized for:
- Training and saving *any* ML model type (not just transformers)
- Deploying to browsers, embedded devices, and serverless
- Zero C/C++ dependencies (security, portability)

```rust,ignore
// GGUF requires: C compiler, platform-specific builds
// .apr requires: Nothing. Pure Rust.

use aprender::format::{save, load, ModelType};

// Works identically on x86_64, ARM, WASM
let model = train_model(&data)?;
save(&model, ModelType::RandomForest, "model.apr", Default::default())?;
```

### Why #2: Why does "Pure Rust" matter?

**Skeptic:** "C/C++ is fast. Who cares about purity?"

**Answer:** Because C/C++ dependencies cause these real problems:

| Problem | Impact | .apr Solution |
|---------|--------|---------------|
| Cross-compilation | Can't easily build ARM from x86 | `cargo build --target aarch64` just works |
| WASM | C libraries don't compile to WASM | Pure Rust compiles to wasm32 |
| Security audits | C code requires separate tooling | `cargo audit` covers everything |
| Supply chain | C deps have separate CVE tracking | Single Rust dependency tree |
| Reproducibility | C builds vary by system | Cargo lockfile guarantees reproducibility |

**Real example:** Try deploying llama.cpp to AWS Lambda ARM64. Now try:

```bash
# .apr deployment to Lambda ARM64
cargo build --release --target aarch64-unknown-linux-gnu
zip lambda.zip target/aarch64-unknown-linux-gnu/release/inference
# Done. No Docker, no cross-compilation toolchain, no prayers.
```

### Why #3: Why does WASM support matter?

**Skeptic:** "ML in the browser is a toy. Serious inference runs on servers."

**Answer:** WASM isn't just browsers. It's:

1. **Cloudflare Workers** - 0ms cold start, runs at edge (200+ cities)
2. **Fastly Compute** - Sub-millisecond inference at edge
3. **Vercel Edge Functions** - Next.js with embedded ML
4. **Embedded WASM** - Wasmtime on IoT devices
5. **Plugin systems** - Sandboxed ML in any application

```rust,ignore
// Same model, same code, runs everywhere
#[cfg(target_arch = "wasm32")]
use aprender::format::load_from_bytes;

const MODEL: &[u8] = include_bytes!("model.apr");

pub fn predict(input: &[f32]) -> Vec<f32> {
    let model: RandomForest = load_from_bytes(MODEL, ModelType::RandomForest)
        .expect("embedded model is valid");
    model.predict_proba(input)
}
```

**Business case:** A Cloudflare Worker costs $0.50/million requests. A GPU VM costs $500+/month. For classification tasks, edge inference is 1000x cheaper.

### Why #4: Why embed models in binaries?

**Skeptic:** "Just download models at runtime like everyone else."

**Answer:** Runtime downloads create these failure modes:

| Failure Mode | Probability | Impact |
|--------------|-------------|--------|
| Network unavailable | Common (planes, submarines, air-gapped) | Total failure |
| CDN outage | Rare but catastrophic | All users affected |
| Model URL changes | Common over years | Silent breakage |
| Version mismatch | Common | Undefined behavior |
| Man-in-the-middle | Possible | Security breach |

**Embedded models eliminate all of these:**

```rust,ignore
// Model is part of the binary. No network. No CDN. No MITM.
const MODEL: &[u8] = include_bytes!("../models/classifier.apr");

fn main() {
    // This CANNOT fail due to network issues
    let model: DecisionTree = load_from_bytes(MODEL, ModelType::DecisionTree)
        .expect("compile-time verified model");

    // Binary hash includes model - tamper-evident
    // Version is locked at compile time - no drift
}
```

**Size impact:** A quantized decision tree is ~50KB. Your binary grows by 50KB. That's nothing.

### Why #5: Why does encryption belong in the format?

**Skeptic:** "Encrypt at the filesystem level. Don't bloat the format."

**Answer:** Filesystem encryption doesn't travel with the model:

```text
Scenario: Share trained model with partner company

Filesystem encryption:
1. Encrypt model file with GPG
2. Send encrypted file + password via separate channel
3. Partner decrypts to filesystem
4. Model now sits unencrypted on their disk
5. Partner's intern accidentally commits it to GitHub
6. Model leaked. Game over.

.apr encryption:
1. Encrypt model for partner's X25519 public key
2. Send .apr file (password never transmitted)
3. Partner loads directly - decryption in memory only
4. Model NEVER exists unencrypted on disk
5. Intern commits .apr file? Useless without private key.
```

```rust,ignore
use aprender::format::{save_for_recipient, load_as_recipient};
use aprender::format::x25519::{PublicKey, SecretKey};

// Sender: Encrypt for specific recipient
save_for_recipient(&model, ModelType::Custom, "partner.apr", opts, &partner_public_key)?;

// Recipient: Decrypt with their secret key (model never touches disk unencrypted)
let model: MyModel = load_as_recipient("partner.apr", ModelType::Custom, &my_secret_key)?;
```

## Deep Dive: JSON Metadata

### Why Metadata in Model Files?

Models often need more than just weights. Tokenizers, vocabulary, config, and custom data should travel with the model:

| Data Type | Without Metadata | With .apr Metadata |
|-----------|------------------|-------------------|
| Vocabulary | Separate `vocab.json` | Embedded in model |
| Config | Separate `config.yaml` | Embedded in model |
| Tokenizer | Separate `tokenizer.json` | Embedded in model |
| Custom | Application-specific files | Single `.apr` file |

### Tokenizer Preservation (PMAT-APR-TOK-001)

**Critical Feature (v1.2.0):** APR files now automatically embed tokenizers during conversion, making them truly self-contained portable files.

| Conversion Path | Tokenizer Source | Preservation |
|-----------------|------------------|--------------|
| SafeTensors → APR | Sibling `tokenizer.json` | ✅ Embedded in APR metadata |
| GGUF → APR | GGUF vocabulary tensors | ✅ Embedded in APR metadata |
| APR Inference | APR metadata | ✅ Automatic token decoding |

**Tokenizer Metadata Keys:**
- `tokenizer.vocabulary` - Full vocabulary list (e.g., 151,643 tokens for Qwen2.5)
- `tokenizer.vocab_size` - Vocabulary size
- `tokenizer.bos_token_id` - Beginning-of-sequence token ID
- `tokenizer.eos_token_id` - End-of-sequence token ID
- `tokenizer.model_type` - Tokenizer type (BPE, etc.)

**Verification:**
```bash
# Check if tokenizer is embedded
strings model.apr | grep "tokenizer.vocabulary"

# Verify vocabulary size
apr inspect model.apr --json | jq '.metadata.tokenizer.vocab_size'
```

### Using JSON Metadata

```rust,ignore
use aprender::serialization::apr::{AprWriter, AprReader};
use serde_json::json;

// Create model with metadata
let mut writer = AprWriter::new();

// Add arbitrary JSON metadata
writer.set_metadata("model_type", json!("whisper-tiny"));
writer.set_metadata("n_vocab", json!(51865));
writer.set_metadata("tokenizer", json!({
    "tokens": ["<|endoftext|>", "<|startoftranscript|>", "the", "a"],
    "merges": [["t", "h"], ["th", "e"]],
    "special_tokens": {"eot": 50256, "sot": 50257}
}));

// Add tensors
writer.add_tensor_f32("encoder.weight", vec![384, 80], &weights);

// Write single file
let bytes = writer.to_bytes()?;

// Read back
let reader = AprReader::from_bytes(bytes)?;
let tokenizer = reader.get_metadata("tokenizer").unwrap();
let weights = reader.read_tensor_f32("encoder.weight")?;
```

### WASM Deployment with Embedded Vocab

This is the killer feature for browser-based ML:

```rust,ignore
// Build time: single file with everything
const MODEL: &[u8] = include_bytes!("whisper-tiny.apr");

// Runtime: no network requests, no additional files
fn transcribe(audio: &[f32]) -> String {
    let reader = AprReader::from_bytes(MODEL.to_vec()).unwrap();

    // Vocab embedded in model
    let vocab = reader.get_metadata("tokenizer").unwrap();
    let tokens = vocab["tokens"].as_array().unwrap();

    // Weights embedded in model
    let encoder_weight = reader.read_tensor_f32("encoder.weight").unwrap();

    // ... inference logic
}
```

**Example:** `cargo run --example apr_with_metadata`

## Deep Dive: trueno Integration

### What is trueno?

trueno is aprender's SIMD and GPU-accelerated tensor library. Unlike NumPy/PyTorch:

- **Pure Rust** - No C/C++/Fortran/CUDA SDK required
- **Auto-vectorization** - Compiler generates optimal SIMD for your CPU
- **Six SIMD backends** - scalar, SSE2, AVX2, AVX-512, NEON (ARM), WASM SIMD128
- **GPU backend** - wgpu (Vulkan/Metal/DX12/WebGPU) for 10-50x speedups
- **Same API everywhere** - Code runs identically on x86, ARM, browsers, GPUs

### Why trueno + .apr?

The `TRUENO_NATIVE` flag (bit 4) enables zero-copy tensor loading:

```text
Traditional loading:
1. Read file bytes
2. Deserialize to intermediate format
3. Allocate new tensors
4. Copy data into tensors
Time: O(n) allocations + O(n) copies

trueno-native loading:
1. mmap file
2. Cast pointer to tensor
3. Done
Time: O(1) - just pointer arithmetic
```

```rust,ignore
// Standard loading (~100ms for 1GB model)
let model: NeuralNet = load("model.apr", ModelType::NeuralSequential)?;

// trueno-native loading (~0.1ms for 1GB model)
// Requires TRUENO_NATIVE flag set during save
let model: NeuralNet = load_mmap("model.apr", ModelType::NeuralSequential)?;
```

**Benchmark: 1GB model load time**

| Method | Time | Memory Overhead |
|--------|------|-----------------|
| PyTorch (pickle) | 2.3s | 2x model size |
| SafeTensors | 450ms | 1x model size |
| GGUF | 380ms | 1x model size |
| .apr (standard) | 320ms | 1x model size |
| .apr (trueno-native) | **0.8ms** | **0x** (mmap) |

## Deep Dive: ARM and Embedded Deployment

### The Problem with Traditional ML Deployment

```text
Traditional: Python → ONNX → TensorRT/OpenVINO → Deploy
- Requires Python for training
- Requires ONNX export (lossy, not all ops supported)
- Requires vendor-specific runtime (TensorRT = NVIDIA only)
- Requires significant RAM for runtime
- Cold start: seconds
```

### The .apr Solution

```text
aprender: Rust → .apr → Deploy
- Training and inference in same language
- Native format (no export step)
- No vendor lock-in
- Minimal RAM (no runtime)
- Cold start: microseconds
```

### Real-World: Raspberry Pi Deployment

```bash
# On your development machine (any OS)
cross build --release --target armv7-unknown-linux-gnueabihf

# Copy single binary to Pi
scp target/armv7-unknown-linux-gnueabihf/release/inference pi@raspberrypi:~/

# On Pi: Just run it
./inference --model embedded  # Model is IN the binary
```

**Resource comparison on Raspberry Pi 4:**

| Framework | Binary Size | RAM Usage | Inference Time |
|-----------|-------------|-----------|----------------|
| TensorFlow Lite | 2.1 MB | 89 MB | 45ms |
| ONNX Runtime | 8.3 MB | 156 MB | 38ms |
| .apr (aprender) | **420 KB** | **12 MB** | **31ms** |

### Real-World: AWS Lambda Deployment

```rust,ignore
// lambda/src/main.rs
use lambda_runtime::{service_fn, LambdaEvent, Error};
use aprender::format::load_from_bytes;
use aprender::tree::DecisionTreeClassifier;

// Model embedded at compile time - no S3, no cold start penalty
const MODEL: &[u8] = include_bytes!("../model.apr");

async fn handler(event: LambdaEvent<Request>) -> Result<Response, Error> {
    // Load from embedded bytes (microseconds, not seconds)
    let model: DecisionTreeClassifier = load_from_bytes(MODEL, ModelType::DecisionTree)?;

    let prediction = model.predict(&event.payload.features);
    Ok(Response { prediction })
}

#[tokio::main]
async fn main() -> Result<(), Error> {
    lambda_runtime::run(service_fn(handler)).await
}
```

**Lambda performance comparison:**

| Approach | Cold Start | Warm Inference | Cost/1M requests |
|----------|------------|----------------|------------------|
| SageMaker endpoint | N/A (always on) | 50ms | $43.80 |
| Lambda + S3 model | 3.2s | 180ms | $0.60 |
| Lambda + .apr embedded | **180ms** | **12ms** | **$0.20** |

## Deep Dive: Security Model

### Threat Model

| Threat | GGUF | SafeTensors | .apr |
|--------|------|-------------|------|
| Model theft (disk access) | Vulnerable | Vulnerable | **Encrypted at rest** |
| Model theft (memory dump) | Vulnerable | Vulnerable | **Encrypted in memory** |
| Tampering detection | None | None | **Ed25519 signatures** |
| Supply chain attack | No verification | No verification | **Signed provenance** |
| Unauthorized redistribution | No protection | No protection | **Recipient encryption** |

### Encryption Architecture

```text
┌─────────────────────────────────────────────────────────────┐
│                     .apr File Structure                      │
├─────────────────────────────────────────────────────────────┤
│ Header (32 bytes)                                            │
│   Magic: "APR\x00"                                          │
│   Version: 1                                                │
│   Flags: ENCRYPTED | SIGNED                                 │
│   Model Type, Compression, Sizes...                         │
├─────────────────────────────────────────────────────────────┤
│ Encryption Block (when ENCRYPTED flag set)                   │
│   Mode: Password | Recipient                                │
│   Salt (16 bytes) | Ephemeral Public Key (32 bytes)         │
│   Nonce (12 bytes)                                          │
├─────────────────────────────────────────────────────────────┤
│ Encrypted Payload                                            │
│   AES-256-GCM ciphertext                                    │
│   (Metadata + Model weights)                                │
├─────────────────────────────────────────────────────────────┤
│ Signature Block (when SIGNED flag set)                       │
│   Ed25519 signature (64 bytes)                              │
│   Signs: Header || Encrypted Payload                        │
├─────────────────────────────────────────────────────────────┤
│ CRC32 Checksum (4 bytes)                                     │
└─────────────────────────────────────────────────────────────┘
```

### Password Encryption (AES-256-GCM + Argon2id)

```rust,ignore
use aprender::format::{save_encrypted, load_encrypted, ModelType};

// Save with password protection
save_encrypted(&model, ModelType::RandomForest, "secret.apr", opts, "hunter2")?;

// Argon2id parameters (OWASP recommended):
// - Memory: 19 MiB (GPU-resistant)
// - Iterations: 2
// - Parallelism: 1
// Derivation time: ~200ms (intentionally slow for brute-force resistance)

// Load requires correct password
let model: RandomForest = load_encrypted("secret.apr", ModelType::RandomForest, "hunter2")?;

// Wrong password: DecryptionFailed error (no partial data leaked)
let result = load_encrypted::<RandomForest>("secret.apr", ModelType::RandomForest, "wrong");
assert!(result.is_err());
```

### Recipient Encryption (X25519 + HKDF + AES-256-GCM)

```rust,ignore
use aprender::format::{save_for_recipient, load_as_recipient};
use aprender::format::x25519::generate_keypair;

// Recipient generates keypair, shares public key
let (recipient_secret, recipient_public) = generate_keypair();

// Sender encrypts for recipient (no shared password!)
save_for_recipient(&model, ModelType::Custom, "for_alice.apr", opts, &recipient_public)?;

// Only recipient can decrypt
let model: MyModel = load_as_recipient("for_alice.apr", ModelType::Custom, &recipient_secret)?;

// Benefits:
// - No password transmission required
// - Forward secrecy (ephemeral sender keys)
// - Non-transferable (cryptographically bound to recipient)
```

## Addressing Common Objections

### "But I need to use HuggingFace models"

**Answer:** We support export to SafeTensors for HuggingFace compatibility:

```rust,ignore
use aprender::format::export_safetensors;

// Train in aprender
let model = train_transformer(&data)?;

// Export for HuggingFace
export_safetensors(&model, "model.safetensors")?;

// Or import from HuggingFace
let model = import_safetensors::<Transformer>("downloaded.safetensors")?;
```

### "But GGUF has better quantization"

**Answer:** We implement GGUF-compatible quantization:

```rust,ignore
use aprender::format::{QuantType, Quantizer};

// Same block sizes as GGUF for compatibility
let quantized = model.quantize(QuantType::Q4_0)?; // 4-bit, 32-element blocks

// Can export to GGUF for llama.cpp compatibility
export_gguf(&quantized, "model.gguf")?;
```

| Quant Type | Bits | Block Size | GGUF Equivalent |
|------------|------|------------|-----------------|
| Q8_0 | 8 | 32 | GGML_TYPE_Q8_0 |
| Q4_0 | 4 | 32 | GGML_TYPE_Q4_0 |
| Q4_1 | 4+min | 32 | GGML_TYPE_Q4_1 |

### "But ONNX is the industry standard"

**Answer:** ONNX requires a C++ runtime. That means:
- No WASM (browsers, edge)
- No embedded (microcontrollers)
- Complex cross-compilation
- Large binary size (+50MB runtime)

If you need ONNX compatibility for legacy systems:

```rust,ignore
// Export for legacy systems that require ONNX
export_onnx(&model, "model.onnx")?;

// But for new deployments, .apr is smaller, faster, and more portable
```

### "But I need GPU inference"

**Answer:** trueno has **production-ready GPU support** via wgpu (Vulkan/Metal/DX12/WebGPU):

```rust,ignore
use trueno::backends::gpu::GpuBackend;

// GPU backend with cross-platform support
let mut gpu = GpuBackend::new();

// Check availability at runtime
if GpuBackend::is_available() {
    // Matrix multiplication: 10-50x faster than SIMD for large matrices
    let result = gpu.matmul(&a, &b, m, k, n)?;

    // All neural network activations on GPU
    let relu_out = gpu.relu(&input)?;
    let sigmoid_out = gpu.sigmoid(&input)?;
    let gelu_out = gpu.gelu(&input)?;      // Transformers
    let softmax_out = gpu.softmax(&input)?; // Classification

    // 2D convolution for CNNs
    let conv_out = gpu.convolve2d(&input, &kernel, h, w, kh, kw)?;
}

// Same .apr model file works on CPU (SIMD) and GPU - backend is runtime choice
```

**trueno GPU capabilities:**
- **Backends**: Vulkan, Metal, DirectX 12, WebGPU (browsers!)
- **Operations**: matmul, dot, relu, leaky_relu, elu, sigmoid, tanh, swish, gelu, softmax, log_softmax, conv2d, clip
- **Performance**: 10-50x speedup for matmul (1000×1000+), 5-20x for reductions (100K+ elements)

## Summary: When to Use .apr

**Use .apr when:**
- Deploying to browsers (WASM)
- Deploying to edge (Cloudflare Workers, Lambda@Edge)
- Deploying to embedded (Raspberry Pi, IoT)
- Deploying to serverless (AWS Lambda, Azure Functions)
- Model security matters (encryption, signing)
- Single-binary deployment is desired
- Cross-platform builds are needed
- Supply chain security is required

**Use GGUF when:**
- Specifically running llama.cpp
- LLM inference is the only use case
- C/C++ toolchain is acceptable

**Use SafeTensors when:**
- HuggingFace ecosystem integration is primary goal
- Python is the deployment target

**Use ONNX when:**
- Legacy system integration required
- Vendor runtime (TensorRT, OpenVINO) is acceptable

## Code: Complete .apr Workflow

```rust,ignore
//! Complete .apr workflow: train, save, encrypt, deploy
//!
//! cargo run --example apr_workflow

use aprender::prelude::*;
use aprender::format::{
    save, load, save_encrypted, load_encrypted,
    save_for_recipient, load_as_recipient,
    ModelType, SaveOptions,
};
use aprender::tree::DecisionTreeClassifier;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 1. Train a model
    let (x_train, y_train) = load_iris_dataset()?;
    let mut model = DecisionTreeClassifier::new().with_max_depth(5);
    model.fit(&x_train, &y_train)?;

    println!("Model trained. Accuracy: {:.2}%", model.score(&x_train, &y_train)? * 100.0);

    // 2. Save with metadata
    let options = SaveOptions::default()
        .with_name("iris-classifier")
        .with_description("Decision tree for Iris classification")
        .with_author("ML Team");

    save(&model, ModelType::DecisionTree, "model.apr", options.clone())?;
    println!("Saved to model.apr");

    // 3. Save encrypted (password)
    save_encrypted(&model, ModelType::DecisionTree, "model-encrypted.apr",
                   options.clone(), "secret-password")?;
    println!("Saved encrypted to model-encrypted.apr");

    // 4. Load and verify
    let loaded: DecisionTreeClassifier = load("model.apr", ModelType::DecisionTree)?;
    assert_eq!(loaded.score(&x_train, &y_train)?, model.score(&x_train, &y_train)?);
    println!("Loaded and verified!");

    // 5. Load encrypted
    let loaded_enc: DecisionTreeClassifier =
        load_encrypted("model-encrypted.apr", ModelType::DecisionTree, "secret-password")?;
    println!("Loaded encrypted model!");

    // 6. Demonstrate embedded deployment
    println!("\nFor embedded deployment, add to your binary:");
    println!("  const MODEL: &[u8] = include_bytes!(\"model.apr\");");
    println!("  let model: DecisionTreeClassifier = load_from_bytes(MODEL, ModelType::DecisionTree)?;");

    // Cleanup
    std::fs::remove_file("model.apr")?;
    std::fs::remove_file("model-encrypted.apr")?;

    Ok(())
}

fn load_iris_dataset() -> Result<(Matrix<f32>, Vec<usize>), Box<dyn std::error::Error>> {
    // Simplified Iris dataset
    let x = Matrix::from_vec(12, 4, vec![
        5.1, 3.5, 1.4, 0.2,  // setosa
        4.9, 3.0, 1.4, 0.2,
        7.0, 3.2, 4.7, 1.4,  // versicolor
        6.4, 3.2, 4.5, 1.5,
        6.3, 3.3, 6.0, 2.5,  // virginica
        5.8, 2.7, 5.1, 1.9,
        5.0, 3.4, 1.5, 0.2,  // setosa
        4.4, 2.9, 1.4, 0.2,
        6.9, 3.1, 4.9, 1.5,  // versicolor
        5.5, 2.3, 4.0, 1.3,
        6.5, 3.0, 5.8, 2.2,  // virginica
        7.6, 3.0, 6.6, 2.1,
    ])?;
    let y = vec![0, 0, 1, 1, 2, 2, 0, 0, 1, 1, 2, 2];
    Ok((x, y))
}
```

## Further Reading

- [Model Format Specification](./model-format.md) - Complete technical spec
- [Shell History Developer Guide](./shell-history-developer-guide.md) - Real-world .apr usage
- Encryption Features - Security deep dive (planned)
- [trueno Documentation](https://docs.rs/trueno) - SIMD tensor library