noru 1.0.3 - Docs.rs

# NNUE Explained — Through the noru Codebase

This document explains how NNUE works using noru's actual source code as examples.
No prior neural network knowledge required.

---

## 1. What is NNUE?

NNUE stands for **Efficiently Updatable Neural Network**. It's a small neural network designed to be:

- **Extremely fast** — evaluated millions of times per second during game tree search
- **Incrementally updatable** — when a piece moves, only a small part of the network needs recalculation

It was invented for Shogi (Japanese chess), adopted by Stockfish (chess), and noru makes it available for **any** two-player board game.

### The Core Idea

In a game engine, you need to evaluate positions: "How good is this board state for the current player?"

Traditional approach: hand-coded rules (material count, piece positions, patterns).
NNUE approach: a neural network learns this evaluation from data.

The trick is that NNUE is structured so that most of the computation can be **reused** between consecutive positions in a search tree.

---

## 2. Network Architecture

Here's what the network looks like:

```
┌─────────────────────────────────────────────────────┐
│                   SPARSE INPUT                       │
│  Active feature indices: [0, 42, 100, 350]          │
│  (out of feature_size possible features)             │
└────────────────────┬────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────┐
│              FEATURE TRANSFORMER                     │
│                                                      │
│  For each active feature index, look up its weight   │
│  row and add it to the accumulator.                  │
│                                                      │
│  accumulator = bias + Σ weights[feature_i]           │
│                                                      │
│  This is done TWICE — once for each perspective:     │
│  • STM  (Side To Move — current player's view)       │
│  • NSTM (Non-Side To Move — opponent's view)         │
└────────────────────┬────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────┐
│              ACTIVATION (CReLU or SCReLU)            │
│                                                      │
│  CReLU:  clamp(x, 0, 1)     — simple clipping       │
│  SCReLU: clamp(x, 0, 1)²    — squaring after clip   │
│                                                      │
│  Then concatenate both perspectives:                 │
│  [STM activated | NSTM activated]                    │
│  Size: accumulator_size × 2                          │
└────────────────────┬────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────┐
│              HIDDEN LAYERS                           │
│                                                      │
│  One or more dense layers, each followed by CReLU.   │
│                                                      │
│  hidden_sizes: &[64]         → one layer of 64       │
│  hidden_sizes: &[256, 32, 32] → three layers         │
│                                                      │
│  Each layer: output = CReLU(weight × input + bias)   │
└────────────────────┬────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────┐
│              OUTPUT (single value)                    │
│                                                      │
│  eval = dot(last_hidden, output_weights) + bias      │
│                                                      │
│  This number represents: "How good is this position  │
│  for the side to move?"                              │
└─────────────────────────────────────────────────────┘
```

In noru, this entire architecture is configured with one struct:

```rust
// from src/config.rs

pub struct NnueConfig {
    pub feature_size: usize,          // how many possible features your game has
    pub accumulator_size: usize,      // width of the accumulator (per perspective)
    pub hidden_sizes: &'static [usize], // hidden layer sizes
    pub activation: Activation,       // CReLU or SCReLU (first layer only)
}
```

---

## 3. Sparse Features — How Games Become Numbers

NNUE doesn't take a dense vector like `[0.0, 0.0, 1.0, 0.0, ...]`. Instead, it takes a **list of active feature indices**:

```
Features: [0, 42, 100, 350]
```

This means "feature 0 is active, feature 42 is active, feature 100 is active, feature 350 is active. All others are inactive."

### Why sparse?

In a chess position, there are ~30 pieces on a 64-square board. If your feature set encodes "piece X on square Y", you might have 768 possible features, but only ~30 are active at any time. Passing 30 indices is much cheaper than passing 768 floats.

### Why two perspectives?

The same board position looks different depending on who's moving. In chess, having a rook on the 7th rank is great if it's YOUR rook, bad if it's your opponent's.

NNUE evaluates from **both** perspectives simultaneously:
- **STM features**: what the current player sees
- **NSTM features**: what the opponent sees

```rust
// from src/trainer.rs

pub struct TrainingSample {
    pub stm_features: Vec<usize>,   // current player's active features
    pub nstm_features: Vec<usize>,  // opponent's active features
    pub target: f32,                 // desired evaluation (0.0 = loss, 1.0 = win)
}
```

---

## 4. The Accumulator — NNUE's Key Innovation

This is what makes NNUE special. The accumulator is simply:

```
accumulator = bias + Σ feature_weights[active_feature]
```

It's a vector (size = `accumulator_size`) that sums up the weight rows of all active features.

```rust
// from src/network.rs — Accumulator::refresh()

pub fn refresh(&mut self, weights: &NnueWeights, stm_features: &[usize], nstm_features: &[usize]) {
    self.stm.copy_from_slice(&weights.feature_bias);   // start from bias
    self.nstm.copy_from_slice(&weights.feature_bias);

    for &feat in stm_features {
        simd::vec_add_i16(&mut self.stm, &weights.feature_weights[feat]);  // add each feature's row
    }
    for &feat in nstm_features {
        simd::vec_add_i16(&mut self.nstm, &weights.feature_weights[feat]);
    }
}
```

### Why is this fast?

In a game tree, when you make a move, typically only **1-2 features change** (one piece moves = one feature removed, one added). Instead of recomputing the entire accumulator, you just:

```
accumulator += weights[new_feature]
accumulator -= weights[old_feature]
```

This is **O(accumulator_size)** instead of **O(feature_size × accumulator_size)**.

```rust
// from src/network.rs — incremental update

fn apply_delta(acc: &mut [i16], weights: &NnueWeights, delta: &FeatureDelta) {
    for i in 0..delta.num_removed {
        simd::vec_sub_i16(acc, &weights.feature_weights[delta.removed[i]]);
    }
    for i in 0..delta.num_added {
        simd::vec_add_i16(acc, &weights.feature_weights[delta.added[i]]);
    }
}
```

### Concrete Example

Say `accumulator_size = 4` and we have 3 active features:

```
bias           = [10, 20, 30, 40]
weight[feat_0] = [ 1,  2,  3,  4]
weight[feat_5] = [ 5, -1,  0,  2]
weight[feat_9] = [-2,  3,  1, -1]

accumulator = [10, 20, 30, 40]   ← start from bias
            + [ 1,  2,  3,  4]   ← add feat_0
            + [ 5, -1,  0,  2]   ← add feat_5
            + [-2,  3,  1, -1]   ← add feat_9
            = [14, 24, 34, 45]
```

Now if feat_5 is removed and feat_7 is added:

```
accumulator = [14, 24, 34, 45]
            - [ 5, -1,  0,  2]   ← remove feat_5
            + [ 3,  0,  2,  1]   ← add feat_7
            = [12, 25, 36, 44]
```

Only 2 vector operations instead of rebuilding from scratch!

---

## 5. Forward Pass — Computing the Evaluation

After the accumulator is ready, the rest of the network is a standard feedforward neural network.

### Training forward pass (FP32)

```rust
// from src/trainer.rs — simplified

// 1. Apply activation to concatenated accumulator
//    CReLU: clamp(x, 0, 1)
//    SCReLU: clamp(x, 0, 1)²
let acc_activated = [crelu(stm), crelu(nstm)];  // size: accumulator_size × 2

// 2. Hidden layers (each: linear transform + CReLU)
for each hidden layer k:
    raw[j] = bias[k][j] + Σ(input[i] * weight[k][i][j])
    activated[j] = clamp(raw[j], 0, 1)

// 3. Output
output = bias + Σ(last_hidden[j] * output_weight[j])
sigmoid = 1 / (1 + exp(-output))    // converts to probability [0, 1]
```

### Inference forward pass (i16 quantized)

The same computation, but using integer arithmetic for speed:

```rust
// from src/network.rs — forward()

// 1. ClippedReLU on accumulator (clamp to [0, 127])
simd::vec_clipped_relu(&mut prev[..acc_size], &acc.stm);
simd::vec_clipped_relu(&mut prev[acc_size..], &acc.nstm);

// 2. Hidden layers using SIMD dot products
for each hidden layer k:
    for each output neuron j:
        sum = bias * ACTIVATION_SCALE + simd::dot_i16_i32(input, weight_row)
        next[j] = clipped_relu(sum / ACTIVATION_SCALE)

// 3. Output
output = (bias * OUTPUT_SCALE + dot(hidden, output_weights)) / OUTPUT_SCALE
```

### Why two versions?

- **FP32 (training)**: Full precision, needed for gradients to flow correctly during backpropagation
- **i16 (inference)**: ~4× faster, good enough for evaluation. The small rounding errors don't matter in practice.

---

## 6. Activation Functions

### CReLU (Clipped ReLU)

```
CReLU(x) = clamp(x, 0, max)

     max ─────────────────/
                         /
    0 ──────────────────/
                       0
```

Simple: anything below 0 becomes 0, anything above max stays at max. This prevents values from exploding.

### SCReLU (Squared Clipped ReLU)

```
SCReLU(x) = clamp(x, 0, max)²

     max² ────────────────╮
                         ╱
                        ╱
                       ╱
    0 ────────────────╱
                     0
```

The squaring gives the network more expressive power near zero (gentle curve instead of sharp corner). This helps learning converge better — Stockfish gained significant Elo by switching from CReLU to SCReLU.

In noru, **SCReLU is only applied to the first layer** (accumulator output). Deeper hidden layers always use CReLU. This follows the Stockfish pattern — applying SCReLU to narrow deep layers causes numerical issues in i16 quantized inference.

```rust
// from src/quant.rs

pub fn screlu_f32(val: f32, max: f32) -> f32 {
    let clamped = val.max(0.0).min(max);
    clamped * clamped
}
```

---

## 7. Backpropagation — How the Network Learns

Training adjusts the weights so that the network's output gets closer to the target value.

### The Chain Rule

Backpropagation computes "how much does each weight contribute to the error?" by working backwards through the network:

```
Error at output
  → How much did each output weight contribute?
    → How much did each hidden neuron contribute?
      → How much did each hidden weight contribute?
        → How much did each accumulator value contribute?
          → How much did each feature weight contribute?
```

### Loss Functions

noru supports two loss functions:

**BCE (Binary Cross-Entropy)** — when the target is a win probability [0, 1]:
```rust
// Gradient at output = sigmoid(output) - target
pub fn backward(&self, sample, fwd, grad) {
    let d_output = fwd.sigmoid - sample.target;
    // ... propagate backwards
}
```

**MSE (Mean Squared Error)** — when the target is a raw score:
```rust
// Gradient at output = output - target
pub fn backward_mse(&self, sample, fwd, grad) {
    let d_output = fwd.output - sample.target;
    // ... propagate backwards
}
```

### Activation Derivatives

Gradients can only flow through activations that are in the "active" region:

- **CReLU derivative**: 1 if 0 < x < max, else 0 (gradient is killed at the boundaries)
- **SCReLU derivative**: 2x if 0 < x < max, else 0 (gradient scales with the input value)

```rust
// from src/quant.rs

pub fn crelu_grad_f32(val: f32, max: f32) -> f32 {
    if val > 0.0 && val < max { 1.0 } else { 0.0 }
}

pub fn screlu_grad_f32(val: f32, max: f32) -> f32 {
    if val > 0.0 && val < max { 2.0 * val } else { 0.0 }
}
```

### Sparse Feature Gradient

A key optimization: feature weights only receive gradients for **active features**. If feature 42 wasn't in the input, its weight row gets zero gradient — no computation needed.

```rust
// from src/trainer.rs — backward_inner() (simplified)

// Only update weights for features that were actually active
for &feat in &sample.stm_features {
    for i in 0..acc {
        grad.ft_weight[feat][i] += d_acc[i];
    }
}
```

---

## 8. Adam Optimizer

After computing gradients, Adam updates the weights:

```rust
// from src/trainer.rs — adam_step()

fn adam_step(param, grad, m, v, lr, bc1, bc2) {
    m = 0.9 * m + 0.1 * grad;              // momentum (smoothed gradient)
    v = 0.999 * v + 0.001 * grad²;         // velocity (smoothed squared gradient)
    m_hat = m / bc1;                         // bias correction
    v_hat = v / bc2;
    param -= lr * m_hat / (√v_hat + ε);    // update
}
```

Why Adam over plain SGD?
- **Momentum (m)**: smooths out noisy gradients, prevents oscillation
- **Velocity (v)**: adapts learning rate per-parameter — weights that need bigger updates get them
- **Bias correction**: compensates for the zero-initialization of m and v in early steps

---

## 9. Quantization — FP32 to i16

After training, weights are converted from f32 to i16 for fast inference:

```rust
// from src/trainer.rs — quantize()

let scale = WEIGHT_SCALE as f32;  // 64

// Each f32 weight is multiplied by 64 and rounded to i16
weights.feature_weights[feat][i] = (row[i] * scale).round() as i16;
```

### Why 64?

A weight of `0.015625` (1/64) becomes `1` in i16. This gives us precision of ~0.016 per step, which is sufficient for evaluation. The scaling factors in inference (`ACTIVATION_SCALE = 256`, `OUTPUT_SCALE = 16`) compensate so the final result is correct despite integer rounding.

### Layout Transposition

An important detail: during quantization, hidden layer weights are **transposed** from input-major (good for training) to output-major (good for SIMD inference):

```
Training:  weights[input_idx][output_idx]   — iterate over inputs for backprop
Inference: weights[output_idx * in_size + input_idx]  — contiguous row per output neuron for dot product
```

---

## 10. SIMD — Hardware Acceleration

SIMD (Single Instruction, Multiple Data) processes multiple values in one CPU instruction:

```
Scalar:  a[0]*b[0], a[1]*b[1], a[2]*b[2], ... (one at a time)
AVX2:    a[0..16] * b[0..16]  (16 multiplications in ONE instruction)
NEON:    a[0..8] * b[0..8]    (8 multiplications in ONE instruction)
```

noru accelerates five operations with SIMD:

| Operation | What it does | Where it's used |
|-----------|-------------|-----------------|
| `vec_add_i16` | Saturating vector add | Accumulator refresh/update |
| `vec_sub_i16` | Saturating vector sub | Accumulator update (remove features) |
| `vec_clipped_relu` | Clamp to [0, 127] | Activation after accumulator |
| `dot_i16_i32` | Dot product → i32 | Hidden layer forward (CReLU) |
| `dot_screlu_i64` | Squared dot → i64 | Hidden layer forward (SCReLU) |

The dispatch is automatic:

```rust
// from src/simd/mod.rs

pub fn vec_add_i16(acc: &mut [i16], w: &[i16]) {
    #[cfg(target_arch = "x86_64")]
    if is_x86_feature_detected!("avx2") {
        unsafe { avx2::vec_add_i16(acc, w) }; return;
    }
    #[cfg(target_arch = "aarch64")]
    { unsafe { neon::vec_add_i16(acc, w) }; return; }
    scalar::vec_add_i16(acc, w);  // fallback
}
```

---

## 11. Binary Model Format

Trained models are saved in noru's v2 binary format:

```
┌──────────────────────────────────┐
│ Header                           │
│  magic: "NORU" (4 bytes)         │
│  version: 2 (4 bytes)           │
│  feature_size (4 bytes)         │
│  accumulator_size (4 bytes)     │
│  num_hidden_layers (4 bytes)    │
│  hidden_sizes[...] (4 each)    │
│  activation (1 byte)           │
├──────────────────────────────────┤
│ Feature weights (i16)           │
│ Feature bias (i16)              │
├──────────────────────────────────┤
│ Hidden layer 0 weights + bias   │
│ Hidden layer 1 weights + bias   │
│ ...                              │
├──────────────────────────────────┤
│ Output weights + bias (i16)     │
└──────────────────────────────────┘
```

The header includes the full network configuration, so a model file is self-describing:

```rust
// Loading auto-detects the format
let weights = NnueWeights::load_from_bytes(&data, None)?;  // reads config from header
```

---

## 12. Putting It All Together

### Training Pipeline

```
1. Design features for your game
2. Generate training data (self-play, expert games, or distillation from a heuristic)
3. Create NnueConfig with desired architecture
4. TrainableWeights::init_random()
5. Loop:
   a. forward() → get prediction
   b. backward() → get gradients
   c. adam_update() → adjust weights
6. quantize() → convert to i16
7. save_to_bytes() → write model file
```

### Inference Pipeline (in a game engine)

```
1. load_from_bytes() → load model
2. Accumulator::new() → initialize from bias
3. At root position: refresh() with all active features
4. For each move in search tree:
   a. update_incremental() → fast accumulator update
   b. forward() → get evaluation score
   c. Use score in alpha-beta / minimax
   d. After undoing the move: update_incremental_undo()
```

### Why NNUE Beats Handcrafted Evaluation

- **Learns patterns** that humans can't easily encode in rules
- **Adapts to any game** — just change the features and retrain
- **Fast enough for deep search** — the incremental update + SIMD makes it practical at millions of evaluations per second