noru 2.1.0 - Docs.rs

![](.github/logo.png)

# NORU

**N**NUE **O**n **RU**st — Zero-dependency NNUE training & inference library in pure Rust.

## What is NNUE?

[NNUE](https://www.chessprogramming.org/NNUE) (Efficiently Updatable Neural Network) is a neural network architecture designed for fast evaluation in game engines. Originally developed for Shogi and adopted by Stockfish, NNUE enables real-time neural network inference through incremental accumulator updates.

## What is NORU?

NORU is a **game-agnostic** NNUE library that provides both training and inference in a single, dependency-free Rust crate. Configure your network dimensions at runtime via `NnueConfig` — no recompilation needed.

### Why this library?

Most NNUE code in the wild lives inside a specific chess engine (Stockfish, Rapfi, …) and is hard-wired to that engine's feature layout in C++. Applying NNUE to a different game — Gomoku, Connect 4, a tactical hex-grid battler — traditionally means either forking one of those engines and rewriting its feature encoder, or re-implementing training from scratch in Python with PyTorch and then writing a separate C++ inference path for deployment.

NORU collapses that pipeline into one pure-Rust crate:

- **One crate for training and inference.** FP32 backprop with Adam for training, i16 quantized forward pass with SIMD acceleration for deployment. You don't leave the Rust toolchain.
- **No learned assumptions about chess.** `NnueConfig` decouples `feature_size`, `accumulator_size`, `hidden_sizes`, and the activation function from the binary layout, so the same crate serves a 4096-feature Gomoku encoder and a 138-feature hex-grid encoder without code changes.
- **No dependencies.** Including the RNG (xorshift64). `cargo add noru` just works on any platform Rust supports, including WebAssembly and ARM embedded targets.
- **Deployment-ready.** A `cdylib` build + the `noru::ffi` layer exposes the inference API to Unity, Godot, C#, and C++ so the same trained weights can ship into a game engine without a Python runtime.

The design target is game AI developers who want Stockfish-class evaluation quality for non-chess domains without paying the integration cost of the chess-engine ecosystem.

### Key Features

- **Multi-hidden-layer** — Arbitrary depth networks (e.g. `&[256, 32, 32]`)
- **CReLU + SCReLU** — Squared Clipped ReLU for stronger accumulator activation
- **SIMD-accelerated inference** — AVX2 (x86_64), NEON (aarch64), with scalar fallback
- **Training + Inference** — FP32 backpropagation with Adam optimizer, i16 quantized inference
- **Zero dependencies** — Pure Rust, no PyTorch, no CUDA, no C bindings
- **Game-agnostic** — Runtime-configurable network dimensions via `NnueConfig`
- **Incremental updates** — Efficient accumulator add/remove for search trees
- **Quantization** — Automatic FP32 → i16 conversion for deployment
- **Binary format v2** — Versioned model serialization with auto-detection
- **C ABI / FFI layer** — `cdylib` build + `noru::ffi` for embedding in Unity, Godot, C#, C++

## Quick Start

Add to your `Cargo.toml`:

```toml
[dependencies]
noru = "2.0"
```

### Training

```rust
use noru::config::{NnueConfig, Activation};
use noru::trainer::{TrainableWeights, AdamState, Gradients, TrainingSample, SimpleRng};

// 1. Define your network dimensions
let config = NnueConfig::new_static(
    530,               // your game's feature count
    256,               // hidden accumulator neurons
    &[64],             // hidden layer sizes (multi-layer: &[256, 32, 32])
    Activation::CReLU, // or Activation::SCReLU
);

// 2. Initialize weights
let mut rng = SimpleRng::new(42);
let mut weights = TrainableWeights::init_random(config.clone(), &mut rng);
let mut adam = AdamState::new(config.clone());

// 3. Train on samples
let sample = TrainingSample {
    stm_features: vec![0, 42, 100],   // active feature indices (side-to-move)
    nstm_features: vec![10, 50, 200], // active feature indices (opponent)
    target: 0.8,                       // evaluation target
};

let fwd = weights.forward(&sample.stm_features, &sample.nstm_features);
let mut grad = Gradients::new(config);
weights.backward_bce(&sample, &fwd, &mut grad);  // BCE loss, target in [0, 1]
// or for raw eval regression:
// weights.backward_raw_mse(&sample, &fwd, &mut grad);
weights.adam_update(&grad, &mut adam, 0.001, 1.0);

// 4. Quantize for deployment
let inference_weights = weights.quantize(); // FP32 → i16
```

### Inference

```rust
use noru::config::{NnueConfig, Activation};
use noru::network::{NnueWeights, Accumulator, FeatureDelta, forward};

// Load quantized weights (v2 format auto-detected)
let weights = NnueWeights::load_from_bytes(&model_bytes, None)?;

// Or with legacy format (requires config)
let config = NnueConfig::new_static(530, 256, &[64], Activation::CReLU);
let weights = NnueWeights::load_from_bytes(&model_bytes, Some(config))?;

// Evaluate a position
let mut acc = Accumulator::new(&weights.feature_bias);
acc.refresh(&weights, &stm_features, &nstm_features);
let eval: i32 = forward(&acc, &weights);

// Incremental update (for search trees)
let delta_stm = FeatureDelta::from_slices(&[new_feature], &[old_feature])?;
let delta_nstm = FeatureDelta::new();
acc.update_incremental(&weights, &delta_stm, &delta_nstm);
```

### Save / Load Models

```rust
// Save
let bytes = weights.save_to_bytes(); // v2 format with NORU header
std::fs::write("model.bin", &bytes)?;

// Load (auto-detects v2 header)
let data = std::fs::read("model.bin")?;
let weights = NnueWeights::load_from_bytes(&data, None)?;
```

## Examples

Runnable examples live in [`examples/`](examples/). Each is a self-contained
binary you can clone and run without a separate game engine:

```sh
# Minimal training → quantization → inference round trip (4-feature toy problem)
cargo run --release --example xor

# Multi-hidden-layer network with SCReLU activation
cargo run --release --example multi_layer

# FP32 → i16 → save → load → inference, reports quantization audit metrics
cargo run --release --example quantize_roundtrip

# Mini board state -> sparse feature extraction -> training/inference loop
cargo run --release --example feature_loop
```

## Applications

NORU has been validated across three games of different branching factors and
feature encodings, which is the primary evidence that the runtime-configurable
design generalizes beyond chess:

- **Gomoku (15×15 Five-in-a-Row).** `figrid-board` v0.4.x ships a pbrain/Piskvork-compatible Gomocup engine (`pbrain-figrid-noru`) built on NORU. Feature set: 4096 (PS + LP-Rich + Compound threats + Density). Configuration: accumulator 512 → hidden 64 → output. Gomocup 2026 submission target. Repo: <https://github.com/nicotina04/figrid-board>.
- **Hex-grid tactical battler.** An auto-extraction RPG combat engine uses NORU for unit-placement evaluation. Feature set: 138 (position-independent, per-class + global). Configuration: accumulator 256 → hidden 64 → output. Demonstrates that non-board-game domains fit the same API.
- **Connect 4.** A minimal second game used as an ablation target to confirm generality; reaches ~45% win rate against a depth-matched heuristic after a few hours of training.

These three share the identical `noru` crate — only `NnueConfig` and the feature extractor differ per domain.

Evidence notes and current public/private artifact status are tracked in
[documents/adoption_evidence.md](documents/adoption_evidence.md) and
[documents/benchmark_inventory.md](documents/benchmark_inventory.md).

## Architecture

```
Input (sparse features)
  ↓
Feature Transform: [feature_size] → [accumulator_size] (per perspective)
  ↓
CReLU or SCReLU
  ↓
Concat: [accumulator_size × 2] (STM + NSTM perspectives)
  ↓
Hidden Layer₁ → CReLU → Hidden Layer₂ → ... → Hidden Layerₙ → CReLU
  ↓
Output Layer → 1 (evaluation score)
```

All dimensions are configured at runtime:

```rust
// Simple (single hidden layer)
let config = NnueConfig::new_static(530, 256, &[64], Activation::CReLU);

// Stockfish-style (multi-layer + SCReLU)
let config = NnueConfig::new_static(768, 1024, &[256, 32, 32], Activation::SCReLU);
```

## SIMD Acceleration

Inference is automatically accelerated on supported platforms:

| Platform | Instruction Set | Width | Auto-detected |
|----------|----------------|-------|---------------|
| x86_64 | AVX2 | 256-bit (16 × i16) | Runtime |
| aarch64 | NEON | 128-bit (8 × i16) | Compile-time |
| Other | Scalar | — | Fallback |

No configuration needed — the fastest available path is selected automatically.

## API Reference

### `noru::config`

| Type | Description |
|------|-------------|
| `NnueConfig` | Network dimensions and activation type (borrowed or owned `hidden_sizes`) |
| `OwnedNnueConfig` | Runtime-constructible variant with `Vec<usize>` hidden sizes; convert via `.into_config()` |
| `Activation` | Activation function enum (`CReLU`, `SCReLU`) |

### `noru::ffi` (C ABI, optional)

NORU is built as a `cdylib` in addition to `rlib`, producing `libnoru.{so,dylib}` / `noru.dll`. The `noru::ffi` module exposes a C ABI surface for embedding in game engines and other non-Rust hosts:

- **Trainer**: `noru_trainer_new / free / forward / backward_bce / backward_raw_mse / zero_grad / adam_step`
- **Accumulator tree-search helpers**: `noru_accumulator_clone / copy_from / update_undo` for alpha-beta without snapshot allocation per node
- **Checkpoint**: `noru_trainer_save_fp32 / load_fp32` (FP32 weight serialization)
- **Quantize**: `noru_trainer_quantize` → `NoruWeights` for inference
- **Inference**: `noru_weights_load / save / free`, `noru_accumulator_new / refresh / update / swap / forward`
- **Errors**: `noru_last_error()` returns a thread-local C string for the most recent failure.

All FFI functions return an `i32` status code (`NORU_OK = 0`, negative values for errors) and catch panics at the boundary. See `src/ffi.rs` for the full surface.

### `noru::audit` (Quantization Drift)

| Type / Function | Description |
|-----------------|-------------|
| `AuditSample` | Borrowed feature lists for audit-only evaluation |
| `FeatureSet` | Trait for reusable STM/NSTM sample adapters |
| `QuantizationReport` | Aggregate sign/range/error metrics for FP32 vs i16 |
| `audit_quantized_model()` | Compare FP32 weights against a quantized model |
| `TrainableWeights::audit_quantization()` | Quantize and audit in one call |
| `NnueWeights::audit_against_fp32()` | Audit a saved/reloaded quantized model |

### `noru::network` (Inference, i16)

| Type / Function | Description |
|-----------------|-------------|
| `NnueWeights` | Quantized i16 weights for inference |
| `NnueWeights::load_from_bytes()` | Load weights from binary (v2 auto-detect) |
| `NnueWeights::save_to_bytes()` | Save weights to v2 binary format |
| `Accumulator` | Maintains per-perspective activation sums |
| `Accumulator::refresh()` | Full recomputation from feature list |
| `Accumulator::update_incremental()` | Efficient add/remove update |
| `Accumulator::swap()` | Swap STM/NSTM perspectives |
| `FeatureDelta` | Tracks added/removed features for incremental updates |
| `FeatureDelta::from_slices()` | Checked constructor that rejects overflow instead of truncating |
| `forward()` | Full forward pass: Accumulator → Hidden layers → Output |

### `noru::trainer` (Training, FP32)

| Type / Function | Description |
|-----------------|-------------|
| `TrainableWeights` | FP32 weights with training methods |
| `TrainableWeights::init_random()` | Kaiming initialization |
| `TrainableWeights::forward()` | FP32 forward pass with intermediate results |
| `TrainableWeights::backward_bce()` | Backpropagation (BCE loss) |
| `TrainableWeights::backward_raw_mse()` | Backpropagation (raw-output MSE loss) |
| `TrainableWeights::adam_update()` | Adam optimizer step |
| `TrainableWeights::quantize()` | FP32 → i16 for deployment |
| `AdamState` | Adam optimizer momentum/velocity state |
| `Gradients` | Gradient accumulation buffer |
| `TrainingSample` | Training data (features + target) |
| `SimpleRng` | Built-in xorshift64 RNG (no external dependency) |

## Development

Local checks:

```bash
cargo fmt --check
cargo test
cargo doc --no-deps
cargo package --allow-dirty --list
```

For contribution and support pathways, see [CONTRIBUTING.md](CONTRIBUTING.md),
[CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md), and [CITATION.cff](CITATION.cff).

## Publication

Draft software-paper materials live in [paper.md](paper.md),
[paper.bib](paper.bib), and
[documents/benchmark_inventory.md](documents/benchmark_inventory.md).

## Reproducibility

For reviewer-facing usage examples beyond the toy demos:

- [examples/feature_loop.rs](examples/feature_loop.rs) shows a small
  board-style feature extractor loop on top of NORU's training and inference
  APIs.
- [examples/ffi_embed.c](examples/ffi_embed.c) shows how a non-Rust host can
  call the C ABI directly.
- [documents/adoption_evidence.md](documents/adoption_evidence.md) summarizes
  verified public downstream evidence and clearly marks what is still local.

### `noru::simd`

| Function | Description |
|----------|-------------|
| `vec_add_i16()` | Saturating i16 vector addition |
| `vec_sub_i16()` | Saturating i16 vector subtraction |
| `vec_clipped_relu()` | ClippedReLU activation (clamp to 0..127) |
| `dot_i16_i32()` | i16 dot product with i32 accumulation |
| `dot_screlu_i64()` | SCReLU squared dot product with i64 accumulation |

### `noru::quant`

| Constant / Function | Description |
|---------------------|-------------|
| `WEIGHT_SCALE` (64) | FP32 → i16 quantization scale |
| `ACTIVATION_SCALE` (256) | Accumulator → Hidden scale |
| `OUTPUT_SCALE` (16) | Final output scale |
| `clipped_relu()` | ClippedReLU activation |
| `screlu_f32()` | Squared ClippedReLU (f32) |
| `saturate_i16()` | Safe i32 → i16 conversion |

## Building

```bash
# Library
cargo build --release

# Run tests
cargo test

# Generate documentation
cargo doc --open
```

## Design Decisions

- **No GPU** — Designed for real-time game AI on CPU. NNUE's strength is being fast enough for depth-4+ search on consumer hardware.
- **No external dependencies** — Even the RNG is built-in (xorshift64). This means `cargo add noru` just works, everywhere.
- **SCReLU on first layer only** — Following the Stockfish pattern, SCReLU is applied to the accumulator output. Subsequent hidden layers always use CReLU to avoid numerical issues in narrow layers.
- **Output-major weight layout** — Hidden layer weights are stored transposed (output-major) for contiguous SIMD memory access in dot products.
- **Vec\<T\> over fixed arrays** — All weights use heap-allocated vectors for runtime flexibility. Slight overhead vs compile-time arrays, but enables one binary for any game.
- **Sparse feature input** — Features are passed as active index lists, not dense vectors. This matches NNUE's design for board games where most features are inactive.

## License

Licensed under either of

- [MIT License](LICENSE-MIT)
- [Apache License, Version 2.0](LICENSE-APACHE)

at your option.

## Related Projects

- [Stockfish NNUE](https://github.com/official-stockfish/Stockfish) — The chess engine that popularized NNUE
- [bullet](https://github.com/jw1912/bullet) — GPU-accelerated NNUE training (Rust + CUDA)
- [Rapfi](https://github.com/dhbloo/rapfi) — Gomoku engine with advanced NNUE