noru 1.0.3

Zero-dependency NNUE training & inference library in pure Rust
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
# NNUE Explained — Through the noru Codebase

This document explains how NNUE works using noru's actual source code as examples.
No prior neural network knowledge required.

---

## 1. What is NNUE?

NNUE stands for **Efficiently Updatable Neural Network**. It's a small neural network designed to be:

- **Extremely fast** — evaluated millions of times per second during game tree search
- **Incrementally updatable** — when a piece moves, only a small part of the network needs recalculation

It was invented for Shogi (Japanese chess), adopted by Stockfish (chess), and noru makes it available for **any** two-player board game.

### The Core Idea

In a game engine, you need to evaluate positions: "How good is this board state for the current player?"

Traditional approach: hand-coded rules (material count, piece positions, patterns).
NNUE approach: a neural network learns this evaluation from data.

The trick is that NNUE is structured so that most of the computation can be **reused** between consecutive positions in a search tree.

---

## 2. Network Architecture

Here's what the network looks like:

```
┌─────────────────────────────────────────────────────┐
│                   SPARSE INPUT                       │
│  Active feature indices: [0, 42, 100, 350]          │
│  (out of feature_size possible features)             │
└────────────────────┬────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│              FEATURE TRANSFORMER                     │
│                                                      │
│  For each active feature index, look up its weight   │
│  row and add it to the accumulator.                  │
│                                                      │
│  accumulator = bias + Σ weights[feature_i]           │
│                                                      │
│  This is done TWICE — once for each perspective:     │
│  • STM  (Side To Move — current player's view)       │
│  • NSTM (Non-Side To Move — opponent's view)         │
└────────────────────┬────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│              ACTIVATION (CReLU or SCReLU)            │
│                                                      │
│  CReLU:  clamp(x, 0, 1)     — simple clipping       │
│  SCReLU: clamp(x, 0, 1)²    — squaring after clip   │
│                                                      │
│  Then concatenate both perspectives:                 │
│  [STM activated | NSTM activated]                    │
│  Size: accumulator_size × 2                          │
└────────────────────┬────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│              HIDDEN LAYERS                           │
│                                                      │
│  One or more dense layers, each followed by CReLU.   │
│                                                      │
│  hidden_sizes: &[64]         → one layer of 64       │
│  hidden_sizes: &[256, 32, 32] → three layers         │
│                                                      │
│  Each layer: output = CReLU(weight × input + bias)   │
└────────────────────┬────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│              OUTPUT (single value)                    │
│                                                      │
│  eval = dot(last_hidden, output_weights) + bias      │
│                                                      │
│  This number represents: "How good is this position  │
│  for the side to move?"                              │
└─────────────────────────────────────────────────────┘
```

In noru, this entire architecture is configured with one struct:

```rust
// from src/config.rs

pub struct NnueConfig {
    pub feature_size: usize,          // how many possible features your game has
    pub accumulator_size: usize,      // width of the accumulator (per perspective)
    pub hidden_sizes: &'static [usize], // hidden layer sizes
    pub activation: Activation,       // CReLU or SCReLU (first layer only)
}
```

---

## 3. Sparse Features — How Games Become Numbers

NNUE doesn't take a dense vector like `[0.0, 0.0, 1.0, 0.0, ...]`. Instead, it takes a **list of active feature indices**:

```
Features: [0, 42, 100, 350]
```

This means "feature 0 is active, feature 42 is active, feature 100 is active, feature 350 is active. All others are inactive."

### Why sparse?

In a chess position, there are ~30 pieces on a 64-square board. If your feature set encodes "piece X on square Y", you might have 768 possible features, but only ~30 are active at any time. Passing 30 indices is much cheaper than passing 768 floats.

### Why two perspectives?

The same board position looks different depending on who's moving. In chess, having a rook on the 7th rank is great if it's YOUR rook, bad if it's your opponent's.

NNUE evaluates from **both** perspectives simultaneously:
- **STM features**: what the current player sees
- **NSTM features**: what the opponent sees

```rust
// from src/trainer.rs

pub struct TrainingSample {
    pub stm_features: Vec<usize>,   // current player's active features
    pub nstm_features: Vec<usize>,  // opponent's active features
    pub target: f32,                 // desired evaluation (0.0 = loss, 1.0 = win)
}
```

---

## 4. The Accumulator — NNUE's Key Innovation

This is what makes NNUE special. The accumulator is simply:

```
accumulator = bias + Σ feature_weights[active_feature]
```

It's a vector (size = `accumulator_size`) that sums up the weight rows of all active features.

```rust
// from src/network.rs — Accumulator::refresh()

pub fn refresh(&mut self, weights: &NnueWeights, stm_features: &[usize], nstm_features: &[usize]) {
    self.stm.copy_from_slice(&weights.feature_bias);   // start from bias
    self.nstm.copy_from_slice(&weights.feature_bias);

    for &feat in stm_features {
        simd::vec_add_i16(&mut self.stm, &weights.feature_weights[feat]);  // add each feature's row
    }
    for &feat in nstm_features {
        simd::vec_add_i16(&mut self.nstm, &weights.feature_weights[feat]);
    }
}
```

### Why is this fast?

In a game tree, when you make a move, typically only **1-2 features change** (one piece moves = one feature removed, one added). Instead of recomputing the entire accumulator, you just:

```
accumulator += weights[new_feature]
accumulator -= weights[old_feature]
```

This is **O(accumulator_size)** instead of **O(feature_size × accumulator_size)**.

```rust
// from src/network.rs — incremental update

fn apply_delta(acc: &mut [i16], weights: &NnueWeights, delta: &FeatureDelta) {
    for i in 0..delta.num_removed {
        simd::vec_sub_i16(acc, &weights.feature_weights[delta.removed[i]]);
    }
    for i in 0..delta.num_added {
        simd::vec_add_i16(acc, &weights.feature_weights[delta.added[i]]);
    }
}
```

### Concrete Example

Say `accumulator_size = 4` and we have 3 active features:

```
bias           = [10, 20, 30, 40]
weight[feat_0] = [ 1,  2,  3,  4]
weight[feat_5] = [ 5, -1,  0,  2]
weight[feat_9] = [-2,  3,  1, -1]

accumulator = [10, 20, 30, 40]   ← start from bias
            + [ 1,  2,  3,  4]   ← add feat_0
            + [ 5, -1,  0,  2]   ← add feat_5
            + [-2,  3,  1, -1]   ← add feat_9
            = [14, 24, 34, 45]
```

Now if feat_5 is removed and feat_7 is added:

```
accumulator = [14, 24, 34, 45]
            - [ 5, -1,  0,  2]   ← remove feat_5
            + [ 3,  0,  2,  1]   ← add feat_7
            = [12, 25, 36, 44]
```

Only 2 vector operations instead of rebuilding from scratch!

---

## 5. Forward Pass — Computing the Evaluation

After the accumulator is ready, the rest of the network is a standard feedforward neural network.

### Training forward pass (FP32)

```rust
// from src/trainer.rs — simplified

// 1. Apply activation to concatenated accumulator
//    CReLU: clamp(x, 0, 1)
//    SCReLU: clamp(x, 0, 1)²
let acc_activated = [crelu(stm), crelu(nstm)];  // size: accumulator_size × 2

// 2. Hidden layers (each: linear transform + CReLU)
for each hidden layer k:
    raw[j] = bias[k][j] + Σ(input[i] * weight[k][i][j])
    activated[j] = clamp(raw[j], 0, 1)

// 3. Output
output = bias + Σ(last_hidden[j] * output_weight[j])
sigmoid = 1 / (1 + exp(-output))    // converts to probability [0, 1]
```

### Inference forward pass (i16 quantized)

The same computation, but using integer arithmetic for speed:

```rust
// from src/network.rs — forward()

// 1. ClippedReLU on accumulator (clamp to [0, 127])
simd::vec_clipped_relu(&mut prev[..acc_size], &acc.stm);
simd::vec_clipped_relu(&mut prev[acc_size..], &acc.nstm);

// 2. Hidden layers using SIMD dot products
for each hidden layer k:
    for each output neuron j:
        sum = bias * ACTIVATION_SCALE + simd::dot_i16_i32(input, weight_row)
        next[j] = clipped_relu(sum / ACTIVATION_SCALE)

// 3. Output
output = (bias * OUTPUT_SCALE + dot(hidden, output_weights)) / OUTPUT_SCALE
```

### Why two versions?

- **FP32 (training)**: Full precision, needed for gradients to flow correctly during backpropagation
- **i16 (inference)**: ~4× faster, good enough for evaluation. The small rounding errors don't matter in practice.

---

## 6. Activation Functions

### CReLU (Clipped ReLU)

```
CReLU(x) = clamp(x, 0, max)

     max ─────────────────/
                         /
    0 ──────────────────/
                       0
```

Simple: anything below 0 becomes 0, anything above max stays at max. This prevents values from exploding.

### SCReLU (Squared Clipped ReLU)

```
SCReLU(x) = clamp(x, 0, max)²

     max² ────────────────╮
    0 ────────────────╱
                     0
```

The squaring gives the network more expressive power near zero (gentle curve instead of sharp corner). This helps learning converge better — Stockfish gained significant Elo by switching from CReLU to SCReLU.

In noru, **SCReLU is only applied to the first layer** (accumulator output). Deeper hidden layers always use CReLU. This follows the Stockfish pattern — applying SCReLU to narrow deep layers causes numerical issues in i16 quantized inference.

```rust
// from src/quant.rs

pub fn screlu_f32(val: f32, max: f32) -> f32 {
    let clamped = val.max(0.0).min(max);
    clamped * clamped
}
```

---

## 7. Backpropagation — How the Network Learns

Training adjusts the weights so that the network's output gets closer to the target value.

### The Chain Rule

Backpropagation computes "how much does each weight contribute to the error?" by working backwards through the network:

```
Error at output
  → How much did each output weight contribute?
    → How much did each hidden neuron contribute?
      → How much did each hidden weight contribute?
        → How much did each accumulator value contribute?
          → How much did each feature weight contribute?
```

### Loss Functions

noru supports two loss functions:

**BCE (Binary Cross-Entropy)** — when the target is a win probability [0, 1]:
```rust
// Gradient at output = sigmoid(output) - target
pub fn backward(&self, sample, fwd, grad) {
    let d_output = fwd.sigmoid - sample.target;
    // ... propagate backwards
}
```

**MSE (Mean Squared Error)** — when the target is a raw score:
```rust
// Gradient at output = output - target
pub fn backward_mse(&self, sample, fwd, grad) {
    let d_output = fwd.output - sample.target;
    // ... propagate backwards
}
```

### Activation Derivatives

Gradients can only flow through activations that are in the "active" region:

- **CReLU derivative**: 1 if 0 < x < max, else 0 (gradient is killed at the boundaries)
- **SCReLU derivative**: 2x if 0 < x < max, else 0 (gradient scales with the input value)

```rust
// from src/quant.rs

pub fn crelu_grad_f32(val: f32, max: f32) -> f32 {
    if val > 0.0 && val < max { 1.0 } else { 0.0 }
}

pub fn screlu_grad_f32(val: f32, max: f32) -> f32 {
    if val > 0.0 && val < max { 2.0 * val } else { 0.0 }
}
```

### Sparse Feature Gradient

A key optimization: feature weights only receive gradients for **active features**. If feature 42 wasn't in the input, its weight row gets zero gradient — no computation needed.

```rust
// from src/trainer.rs — backward_inner() (simplified)

// Only update weights for features that were actually active
for &feat in &sample.stm_features {
    for i in 0..acc {
        grad.ft_weight[feat][i] += d_acc[i];
    }
}
```

---

## 8. Adam Optimizer

After computing gradients, Adam updates the weights:

```rust
// from src/trainer.rs — adam_step()

fn adam_step(param, grad, m, v, lr, bc1, bc2) {
    m = 0.9 * m + 0.1 * grad;              // momentum (smoothed gradient)
    v = 0.999 * v + 0.001 * grad²;         // velocity (smoothed squared gradient)
    m_hat = m / bc1;                         // bias correction
    v_hat = v / bc2;
    param -= lr * m_hat / (√v_hat + ε);    // update
}
```

Why Adam over plain SGD?
- **Momentum (m)**: smooths out noisy gradients, prevents oscillation
- **Velocity (v)**: adapts learning rate per-parameter — weights that need bigger updates get them
- **Bias correction**: compensates for the zero-initialization of m and v in early steps

---

## 9. Quantization — FP32 to i16

After training, weights are converted from f32 to i16 for fast inference:

```rust
// from src/trainer.rs — quantize()

let scale = WEIGHT_SCALE as f32;  // 64

// Each f32 weight is multiplied by 64 and rounded to i16
weights.feature_weights[feat][i] = (row[i] * scale).round() as i16;
```

### Why 64?

A weight of `0.015625` (1/64) becomes `1` in i16. This gives us precision of ~0.016 per step, which is sufficient for evaluation. The scaling factors in inference (`ACTIVATION_SCALE = 256`, `OUTPUT_SCALE = 16`) compensate so the final result is correct despite integer rounding.

### Layout Transposition

An important detail: during quantization, hidden layer weights are **transposed** from input-major (good for training) to output-major (good for SIMD inference):

```
Training:  weights[input_idx][output_idx]   — iterate over inputs for backprop
Inference: weights[output_idx * in_size + input_idx]  — contiguous row per output neuron for dot product
```

---

## 10. SIMD — Hardware Acceleration

SIMD (Single Instruction, Multiple Data) processes multiple values in one CPU instruction:

```
Scalar:  a[0]*b[0], a[1]*b[1], a[2]*b[2], ... (one at a time)
AVX2:    a[0..16] * b[0..16]  (16 multiplications in ONE instruction)
NEON:    a[0..8] * b[0..8]    (8 multiplications in ONE instruction)
```

noru accelerates five operations with SIMD:

| Operation | What it does | Where it's used |
|-----------|-------------|-----------------|
| `vec_add_i16` | Saturating vector add | Accumulator refresh/update |
| `vec_sub_i16` | Saturating vector sub | Accumulator update (remove features) |
| `vec_clipped_relu` | Clamp to [0, 127] | Activation after accumulator |
| `dot_i16_i32` | Dot product → i32 | Hidden layer forward (CReLU) |
| `dot_screlu_i64` | Squared dot → i64 | Hidden layer forward (SCReLU) |

The dispatch is automatic:

```rust
// from src/simd/mod.rs

pub fn vec_add_i16(acc: &mut [i16], w: &[i16]) {
    #[cfg(target_arch = "x86_64")]
    if is_x86_feature_detected!("avx2") {
        unsafe { avx2::vec_add_i16(acc, w) }; return;
    }
    #[cfg(target_arch = "aarch64")]
    { unsafe { neon::vec_add_i16(acc, w) }; return; }
    scalar::vec_add_i16(acc, w);  // fallback
}
```

---

## 11. Binary Model Format

Trained models are saved in noru's v2 binary format:

```
┌──────────────────────────────────┐
│ Header                           │
│  magic: "NORU" (4 bytes)         │
│  version: 2 (4 bytes)           │
│  feature_size (4 bytes)         │
│  accumulator_size (4 bytes)     │
│  num_hidden_layers (4 bytes)    │
│  hidden_sizes[...] (4 each)    │
│  activation (1 byte)           │
├──────────────────────────────────┤
│ Feature weights (i16)           │
│ Feature bias (i16)              │
├──────────────────────────────────┤
│ Hidden layer 0 weights + bias   │
│ Hidden layer 1 weights + bias   │
│ ...                              │
├──────────────────────────────────┤
│ Output weights + bias (i16)     │
└──────────────────────────────────┘
```

The header includes the full network configuration, so a model file is self-describing:

```rust
// Loading auto-detects the format
let weights = NnueWeights::load_from_bytes(&data, None)?;  // reads config from header
```

---

## 12. Putting It All Together

### Training Pipeline

```
1. Design features for your game
2. Generate training data (self-play, expert games, or distillation from a heuristic)
3. Create NnueConfig with desired architecture
4. TrainableWeights::init_random()
5. Loop:
   a. forward() → get prediction
   b. backward() → get gradients
   c. adam_update() → adjust weights
6. quantize() → convert to i16
7. save_to_bytes() → write model file
```

### Inference Pipeline (in a game engine)

```
1. load_from_bytes() → load model
2. Accumulator::new() → initialize from bias
3. At root position: refresh() with all active features
4. For each move in search tree:
   a. update_incremental() → fast accumulator update
   b. forward() → get evaluation score
   c. Use score in alpha-beta / minimax
   d. After undoing the move: update_incremental_undo()
```

### Why NNUE Beats Handcrafted Evaluation

- **Learns patterns** that humans can't easily encode in rules
- **Adapts to any game** — just change the features and retrain
- **Fast enough for deep search** — the incremental update + SIMD makes it practical at millions of evaluations per second