# NNUE Explained — Through the noru Codebase
This document explains how NNUE works using noru's actual source code as examples.
No prior neural network knowledge required.
---
## 1. What is NNUE?
NNUE stands for **Efficiently Updatable Neural Network**. It's a small neural network designed to be:
- **Extremely fast** — evaluated millions of times per second during game tree search
- **Incrementally updatable** — when a piece moves, only a small part of the network needs recalculation
It was invented for Shogi (Japanese chess), adopted by Stockfish (chess), and noru makes it available for **any** two-player board game.
### The Core Idea
In a game engine, you need to evaluate positions: "How good is this board state for the current player?"
Traditional approach: hand-coded rules (material count, piece positions, patterns).
NNUE approach: a neural network learns this evaluation from data.
The trick is that NNUE is structured so that most of the computation can be **reused** between consecutive positions in a search tree.
---
## 2. Network Architecture
Here's what the network looks like:
```
┌─────────────────────────────────────────────────────┐
│ SPARSE INPUT │
│ Active feature indices: [0, 42, 100, 350] │
│ (out of feature_size possible features) │
└────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ FEATURE TRANSFORMER │
│ │
│ For each active feature index, look up its weight │
│ row and add it to the accumulator. │
│ │
│ accumulator = bias + Σ weights[feature_i] │
│ │
│ This is done TWICE — once for each perspective: │
│ • STM (Side To Move — current player's view) │
│ • NSTM (Non-Side To Move — opponent's view) │
└────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ ACTIVATION (CReLU or SCReLU) │
│ │
│ CReLU: clamp(x, 0, 1) — simple clipping │
│ SCReLU: clamp(x, 0, 1)² — squaring after clip │
│ │
│ Then concatenate both perspectives: │
│ [STM activated | NSTM activated] │
│ Size: accumulator_size × 2 │
└────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ HIDDEN LAYERS │
│ │
│ One or more dense layers, each followed by CReLU. │
│ │
│ hidden_sizes: &[64] → one layer of 64 │
│ hidden_sizes: &[256, 32, 32] → three layers │
│ │
│ Each layer: output = CReLU(weight × input + bias) │
└────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ OUTPUT (single value) │
│ │
│ eval = dot(last_hidden, output_weights) + bias │
│ │
│ This number represents: "How good is this position │
│ for the side to move?" │
└─────────────────────────────────────────────────────┘
```
In noru, this entire architecture is configured with one struct:
```rust
// from src/config.rs
pub struct NnueConfig {
pub feature_size: usize, // how many possible features your game has
pub accumulator_size: usize, // width of the accumulator (per perspective)
pub hidden_sizes: &'static [usize], // hidden layer sizes
pub activation: Activation, // CReLU or SCReLU (first layer only)
}
```
---
## 3. Sparse Features — How Games Become Numbers
NNUE doesn't take a dense vector like `[0.0, 0.0, 1.0, 0.0, ...]`. Instead, it takes a **list of active feature indices**:
```
Features: [0, 42, 100, 350]
```
This means "feature 0 is active, feature 42 is active, feature 100 is active, feature 350 is active. All others are inactive."
### Why sparse?
In a chess position, there are ~30 pieces on a 64-square board. If your feature set encodes "piece X on square Y", you might have 768 possible features, but only ~30 are active at any time. Passing 30 indices is much cheaper than passing 768 floats.
### Why two perspectives?
The same board position looks different depending on who's moving. In chess, having a rook on the 7th rank is great if it's YOUR rook, bad if it's your opponent's.
NNUE evaluates from **both** perspectives simultaneously:
- **STM features**: what the current player sees
- **NSTM features**: what the opponent sees
```rust
// from src/trainer.rs
pub struct TrainingSample {
pub stm_features: Vec<usize>, // current player's active features
pub nstm_features: Vec<usize>, // opponent's active features
pub target: f32, // desired evaluation (0.0 = loss, 1.0 = win)
}
```
---
## 4. The Accumulator — NNUE's Key Innovation
This is what makes NNUE special. The accumulator is simply:
```
accumulator = bias + Σ feature_weights[active_feature]
```
It's a vector (size = `accumulator_size`) that sums up the weight rows of all active features.
```rust
// from src/network.rs — Accumulator::refresh()
pub fn refresh(&mut self, weights: &NnueWeights, stm_features: &[usize], nstm_features: &[usize]) {
self.stm.copy_from_slice(&weights.feature_bias); // start from bias
self.nstm.copy_from_slice(&weights.feature_bias);
for &feat in stm_features {
simd::vec_add_i16(&mut self.stm, &weights.feature_weights[feat]); // add each feature's row
}
for &feat in nstm_features {
simd::vec_add_i16(&mut self.nstm, &weights.feature_weights[feat]);
}
}
```
### Why is this fast?
In a game tree, when you make a move, typically only **1-2 features change** (one piece moves = one feature removed, one added). Instead of recomputing the entire accumulator, you just:
```
accumulator += weights[new_feature]
accumulator -= weights[old_feature]
```
This is **O(accumulator_size)** instead of **O(feature_size × accumulator_size)**.
```rust
// from src/network.rs — incremental update
fn apply_delta(acc: &mut [i16], weights: &NnueWeights, delta: &FeatureDelta) {
for i in 0..delta.num_removed {
simd::vec_sub_i16(acc, &weights.feature_weights[delta.removed[i]]);
}
for i in 0..delta.num_added {
simd::vec_add_i16(acc, &weights.feature_weights[delta.added[i]]);
}
}
```
### Concrete Example
Say `accumulator_size = 4` and we have 3 active features:
```
bias = [10, 20, 30, 40]
weight[feat_0] = [ 1, 2, 3, 4]
weight[feat_5] = [ 5, -1, 0, 2]
weight[feat_9] = [-2, 3, 1, -1]
accumulator = [10, 20, 30, 40] ← start from bias
+ [ 1, 2, 3, 4] ← add feat_0
+ [ 5, -1, 0, 2] ← add feat_5
+ [-2, 3, 1, -1] ← add feat_9
= [14, 24, 34, 45]
```
Now if feat_5 is removed and feat_7 is added:
```
accumulator = [14, 24, 34, 45]
- [ 5, -1, 0, 2] ← remove feat_5
+ [ 3, 0, 2, 1] ← add feat_7
= [12, 25, 36, 44]
```
Only 2 vector operations instead of rebuilding from scratch!
---
## 5. Forward Pass — Computing the Evaluation
After the accumulator is ready, the rest of the network is a standard feedforward neural network.
### Training forward pass (FP32)
```rust
// from src/trainer.rs — simplified
// 1. Apply activation to concatenated accumulator
// CReLU: clamp(x, 0, 1)
// SCReLU: clamp(x, 0, 1)²
let acc_activated = [crelu(stm), crelu(nstm)]; // size: accumulator_size × 2
// 2. Hidden layers (each: linear transform + CReLU)
for each hidden layer k:
raw[j] = bias[k][j] + Σ(input[i] * weight[k][i][j])
activated[j] = clamp(raw[j], 0, 1)
// 3. Output
output = bias + Σ(last_hidden[j] * output_weight[j])
sigmoid = 1 / (1 + exp(-output)) // converts to probability [0, 1]
```
### Inference forward pass (i16 quantized)
The same computation, but using integer arithmetic for speed:
```rust
// from src/network.rs — forward()
// 1. ClippedReLU on accumulator (clamp to [0, 127])
simd::vec_clipped_relu(&mut prev[..acc_size], &acc.stm);
simd::vec_clipped_relu(&mut prev[acc_size..], &acc.nstm);
// 2. Hidden layers using SIMD dot products
for each hidden layer k:
for each output neuron j:
sum = bias * ACTIVATION_SCALE + simd::dot_i16_i32(input, weight_row)
next[j] = clipped_relu(sum / ACTIVATION_SCALE)
// 3. Output
output = (bias * OUTPUT_SCALE + dot(hidden, output_weights)) / OUTPUT_SCALE
```
### Why two versions?
- **FP32 (training)**: Full precision, needed for gradients to flow correctly during backpropagation
- **i16 (inference)**: ~4× faster, good enough for evaluation. The small rounding errors don't matter in practice.
---
## 6. Activation Functions
### CReLU (Clipped ReLU)
```
CReLU(x) = clamp(x, 0, max)
max ─────────────────/
/
0 ──────────────────/
0
```
Simple: anything below 0 becomes 0, anything above max stays at max. This prevents values from exploding.
### SCReLU (Squared Clipped ReLU)
```
SCReLU(x) = clamp(x, 0, max)²
max² ────────────────╮
╱
╱
╱
0 ────────────────╱
0
```
The squaring gives the network more expressive power near zero (gentle curve instead of sharp corner). This helps learning converge better — Stockfish gained significant Elo by switching from CReLU to SCReLU.
In noru, **SCReLU is only applied to the first layer** (accumulator output). Deeper hidden layers always use CReLU. This follows the Stockfish pattern — applying SCReLU to narrow deep layers causes numerical issues in i16 quantized inference.
```rust
// from src/quant.rs
pub fn screlu_f32(val: f32, max: f32) -> f32 {
let clamped = val.max(0.0).min(max);
clamped * clamped
}
```
---
## 7. Backpropagation — How the Network Learns
Training adjusts the weights so that the network's output gets closer to the target value.
### The Chain Rule
Backpropagation computes "how much does each weight contribute to the error?" by working backwards through the network:
```
Error at output
→ How much did each output weight contribute?
→ How much did each hidden neuron contribute?
→ How much did each hidden weight contribute?
→ How much did each accumulator value contribute?
→ How much did each feature weight contribute?
```
### Loss Functions
noru supports two loss functions:
**BCE (Binary Cross-Entropy)** — when the target is a win probability [0, 1]:
```rust
// Gradient at output = sigmoid(output) - target
pub fn backward(&self, sample, fwd, grad) {
let d_output = fwd.sigmoid - sample.target;
// ... propagate backwards
}
```
**MSE (Mean Squared Error)** — when the target is a raw score:
```rust
// Gradient at output = output - target
pub fn backward_mse(&self, sample, fwd, grad) {
let d_output = fwd.output - sample.target;
// ... propagate backwards
}
```
### Activation Derivatives
Gradients can only flow through activations that are in the "active" region:
- **CReLU derivative**: 1 if 0 < x < max, else 0 (gradient is killed at the boundaries)
- **SCReLU derivative**: 2x if 0 < x < max, else 0 (gradient scales with the input value)
```rust
// from src/quant.rs
pub fn crelu_grad_f32(val: f32, max: f32) -> f32 {
if val > 0.0 && val < max { 1.0 } else { 0.0 }
}
pub fn screlu_grad_f32(val: f32, max: f32) -> f32 {
if val > 0.0 && val < max { 2.0 * val } else { 0.0 }
}
```
### Sparse Feature Gradient
A key optimization: feature weights only receive gradients for **active features**. If feature 42 wasn't in the input, its weight row gets zero gradient — no computation needed.
```rust
// from src/trainer.rs — backward_inner() (simplified)
// Only update weights for features that were actually active
for &feat in &sample.stm_features {
for i in 0..acc {
grad.ft_weight[feat][i] += d_acc[i];
}
}
```
---
## 8. Adam Optimizer
After computing gradients, Adam updates the weights:
```rust
// from src/trainer.rs — adam_step()
fn adam_step(param, grad, m, v, lr, bc1, bc2) {
m = 0.9 * m + 0.1 * grad; // momentum (smoothed gradient)
v = 0.999 * v + 0.001 * grad²; // velocity (smoothed squared gradient)
m_hat = m / bc1; // bias correction
v_hat = v / bc2;
param -= lr * m_hat / (√v_hat + ε); // update
}
```
Why Adam over plain SGD?
- **Momentum (m)**: smooths out noisy gradients, prevents oscillation
- **Velocity (v)**: adapts learning rate per-parameter — weights that need bigger updates get them
- **Bias correction**: compensates for the zero-initialization of m and v in early steps
---
## 9. Quantization — FP32 to i16
After training, weights are converted from f32 to i16 for fast inference:
```rust
// from src/trainer.rs — quantize()
let scale = WEIGHT_SCALE as f32; // 64
// Each f32 weight is multiplied by 64 and rounded to i16
weights.feature_weights[feat][i] = (row[i] * scale).round() as i16;
```
### Why 64?
A weight of `0.015625` (1/64) becomes `1` in i16. This gives us precision of ~0.016 per step, which is sufficient for evaluation. The scaling factors in inference (`ACTIVATION_SCALE = 256`, `OUTPUT_SCALE = 16`) compensate so the final result is correct despite integer rounding.
### Layout Transposition
An important detail: during quantization, hidden layer weights are **transposed** from input-major (good for training) to output-major (good for SIMD inference):
```
Training: weights[input_idx][output_idx] — iterate over inputs for backprop
Inference: weights[output_idx * in_size + input_idx] — contiguous row per output neuron for dot product
```
---
## 10. SIMD — Hardware Acceleration
SIMD (Single Instruction, Multiple Data) processes multiple values in one CPU instruction:
```
Scalar: a[0]*b[0], a[1]*b[1], a[2]*b[2], ... (one at a time)
AVX2: a[0..16] * b[0..16] (16 multiplications in ONE instruction)
NEON: a[0..8] * b[0..8] (8 multiplications in ONE instruction)
```
noru accelerates five operations with SIMD:
| `vec_add_i16` | Saturating vector add | Accumulator refresh/update |
| `vec_sub_i16` | Saturating vector sub | Accumulator update (remove features) |
| `vec_clipped_relu` | Clamp to [0, 127] | Activation after accumulator |
| `dot_i16_i32` | Dot product → i32 | Hidden layer forward (CReLU) |
| `dot_screlu_i64` | Squared dot → i64 | Hidden layer forward (SCReLU) |
The dispatch is automatic:
```rust
// from src/simd/mod.rs
pub fn vec_add_i16(acc: &mut [i16], w: &[i16]) {
#[cfg(target_arch = "x86_64")]
if is_x86_feature_detected!("avx2") {
unsafe { avx2::vec_add_i16(acc, w) }; return;
}
#[cfg(target_arch = "aarch64")]
{ unsafe { neon::vec_add_i16(acc, w) }; return; }
scalar::vec_add_i16(acc, w); // fallback
}
```
---
## 11. Binary Model Format
Trained models are saved in noru's v2 binary format:
```
┌──────────────────────────────────┐
│ Header │
│ magic: "NORU" (4 bytes) │
│ version: 2 (4 bytes) │
│ feature_size (4 bytes) │
│ accumulator_size (4 bytes) │
│ num_hidden_layers (4 bytes) │
│ hidden_sizes[...] (4 each) │
│ activation (1 byte) │
├──────────────────────────────────┤
│ Feature weights (i16) │
│ Feature bias (i16) │
├──────────────────────────────────┤
│ Hidden layer 0 weights + bias │
│ Hidden layer 1 weights + bias │
│ ... │
├──────────────────────────────────┤
│ Output weights + bias (i16) │
└──────────────────────────────────┘
```
The header includes the full network configuration, so a model file is self-describing:
```rust
// Loading auto-detects the format
let weights = NnueWeights::load_from_bytes(&data, None)?; // reads config from header
```
---
## 12. Putting It All Together
### Training Pipeline
```
1. Design features for your game
2. Generate training data (self-play, expert games, or distillation from a heuristic)
3. Create NnueConfig with desired architecture
4. TrainableWeights::init_random()
5. Loop:
a. forward() → get prediction
b. backward() → get gradients
c. adam_update() → adjust weights
6. quantize() → convert to i16
7. save_to_bytes() → write model file
```
### Inference Pipeline (in a game engine)
```
1. load_from_bytes() → load model
2. Accumulator::new() → initialize from bias
3. At root position: refresh() with all active features
4. For each move in search tree:
a. update_incremental() → fast accumulator update
b. forward() → get evaluation score
c. Use score in alpha-beta / minimax
d. After undoing the move: update_incremental_undo()
```
### Why NNUE Beats Handcrafted Evaluation
- **Learns patterns** that humans can't easily encode in rules
- **Adapts to any game** — just change the features and retrain
- **Fast enough for deep search** — the incremental update + SIMD makes it practical at millions of evaluations per second