
# NORU
**N**NUE **O**n **RU**st — Zero-dependency NNUE training & inference library in pure Rust.
## What is NNUE?
[NNUE](https://www.chessprogramming.org/NNUE) (Efficiently Updatable Neural Network) is a neural network architecture designed for fast evaluation in game engines. Originally developed for Shogi and adopted by Stockfish, NNUE enables real-time neural network inference through incremental accumulator updates.
## What is NORU?
NORU is a **game-agnostic** NNUE library that provides both training and inference in a single, dependency-free Rust crate. Configure your network dimensions at runtime via `NnueConfig` — no recompilation needed.
### Why this library?
Most NNUE code in the wild lives inside a specific chess engine (Stockfish, Rapfi, …) and is hard-wired to that engine's feature layout in C++. Applying NNUE to a different game — Gomoku, Connect 4, a tactical hex-grid battler — traditionally means either forking one of those engines and rewriting its feature encoder, or re-implementing training from scratch in Python with PyTorch and then writing a separate C++ inference path for deployment.
NORU collapses that pipeline into one pure-Rust crate:
- **One crate for training and inference.** FP32 backprop with Adam for training, i16 quantized forward pass with SIMD acceleration for deployment. You don't leave the Rust toolchain.
- **No learned assumptions about chess.** `NnueConfig` decouples `feature_size`, `accumulator_size`, `hidden_sizes`, and the activation function from the binary layout, so the same crate serves a 4096-feature Gomoku encoder and a 138-feature hex-grid encoder without code changes.
- **No dependencies.** Including the RNG (xorshift64). `cargo add noru` just works on any platform Rust supports, including WebAssembly and ARM embedded targets.
- **Deployment-ready.** A `cdylib` build + the `noru::ffi` layer exposes the inference API to Unity, Godot, C#, and C++ so the same trained weights can ship into a game engine without a Python runtime.
The design target is game AI developers who want Stockfish-class evaluation quality for non-chess domains without paying the integration cost of the chess-engine ecosystem.
### Key Features
- **Multi-hidden-layer** — Arbitrary depth networks (e.g. `&[256, 32, 32]`)
- **CReLU + SCReLU** — Squared Clipped ReLU for stronger accumulator activation
- **SIMD-accelerated inference** — AVX2 (x86_64), NEON (aarch64), with scalar fallback
- **Training + Inference** — FP32 backpropagation with Adam optimizer, i16 quantized inference
- **Zero dependencies** — Pure Rust, no PyTorch, no CUDA, no C bindings
- **Game-agnostic** — Runtime-configurable network dimensions via `NnueConfig`
- **Incremental updates** — Efficient accumulator add/remove for search trees
- **Quantization** — Automatic FP32 → i16 conversion for deployment
- **Binary format v2** — Versioned model serialization with auto-detection
- **C ABI / FFI layer** — `cdylib` build + `noru::ffi` for embedding in Unity, Godot, C#, C++
## Quick Start
Add to your `Cargo.toml`:
```toml
[dependencies]
noru = "2.0"
```
### Training
```rust
use noru::config::{NnueConfig, Activation};
use noru::trainer::{TrainableWeights, AdamState, Gradients, TrainingSample, SimpleRng};
// 1. Define your network dimensions
let config = NnueConfig::new_static(
530, // your game's feature count
256, // hidden accumulator neurons
&[64], // hidden layer sizes (multi-layer: &[256, 32, 32])
Activation::CReLU, // or Activation::SCReLU
);
// 2. Initialize weights
let mut rng = SimpleRng::new(42);
let mut weights = TrainableWeights::init_random(config.clone(), &mut rng);
let mut adam = AdamState::new(config.clone());
// 3. Train on samples
let sample = TrainingSample {
stm_features: vec![0, 42, 100], // active feature indices (side-to-move)
nstm_features: vec![10, 50, 200], // active feature indices (opponent)
target: 0.8, // evaluation target
};
let fwd = weights.forward(&sample.stm_features, &sample.nstm_features);
let mut grad = Gradients::new(config);
weights.backward_bce(&sample, &fwd, &mut grad); // BCE loss, target in [0, 1]
// or for raw eval regression:
// weights.backward_raw_mse(&sample, &fwd, &mut grad);
weights.adam_update(&grad, &mut adam, 0.001, 1.0);
// 4. Quantize for deployment
let inference_weights = weights.quantize(); // FP32 → i16
```
### Inference
```rust
use noru::config::{NnueConfig, Activation};
use noru::network::{NnueWeights, Accumulator, FeatureDelta, forward};
// Load quantized weights (v2 format auto-detected)
let weights = NnueWeights::load_from_bytes(&model_bytes, None)?;
// Or with legacy format (requires config)
let config = NnueConfig::new_static(530, 256, &[64], Activation::CReLU);
let weights = NnueWeights::load_from_bytes(&model_bytes, Some(config))?;
// Evaluate a position
let mut acc = Accumulator::new(&weights.feature_bias);
acc.refresh(&weights, &stm_features, &nstm_features);
let eval: i32 = forward(&acc, &weights);
// Incremental update (for search trees)
let delta_stm = FeatureDelta::from_slices(&[new_feature], &[old_feature])?;
let delta_nstm = FeatureDelta::new();
acc.update_incremental(&weights, &delta_stm, &delta_nstm);
```
### Save / Load Models
```rust
// Save
let bytes = weights.save_to_bytes(); // v2 format with NORU header
std::fs::write("model.bin", &bytes)?;
// Load (auto-detects v2 header)
let data = std::fs::read("model.bin")?;
let weights = NnueWeights::load_from_bytes(&data, None)?;
```
## Examples
Runnable examples live in [`examples/`](examples/). Each is a self-contained
binary you can clone and run without a separate game engine:
```sh
# Minimal training → quantization → inference round trip (4-feature toy problem)
cargo run --release --example xor
# Multi-hidden-layer network with SCReLU activation
cargo run --release --example multi_layer
# FP32 → i16 → save → load → inference, reports quantization audit metrics
cargo run --release --example quantize_roundtrip
# Mini board state -> sparse feature extraction -> training/inference loop
cargo run --release --example feature_loop
```
## Applications
NORU has been validated across three games of different branching factors and
feature encodings, which is the primary evidence that the runtime-configurable
design generalizes beyond chess:
- **Gomoku (15×15 Five-in-a-Row).** `figrid-board` v0.4.x ships a pbrain/Piskvork-compatible Gomocup engine (`pbrain-figrid-noru`) built on NORU. Feature set: 4096 (PS + LP-Rich + Compound threats + Density). Configuration: accumulator 512 → hidden 64 → output. Gomocup 2026 submission target. Repo: <https://github.com/nicotina04/figrid-board>.
- **Hex-grid tactical battler.** An auto-extraction RPG combat engine uses NORU for unit-placement evaluation. Feature set: 138 (position-independent, per-class + global). Configuration: accumulator 256 → hidden 64 → output. Demonstrates that non-board-game domains fit the same API.
- **Connect 4.** A minimal second game used as an ablation target to confirm generality; reaches ~45% win rate against a depth-matched heuristic after a few hours of training.
These three share the identical `noru` crate — only `NnueConfig` and the feature extractor differ per domain.
Evidence notes and current public/private artifact status are tracked in
[documents/adoption_evidence.md](documents/adoption_evidence.md) and
[documents/benchmark_inventory.md](documents/benchmark_inventory.md).
## Architecture
```
Input (sparse features)
↓
Feature Transform: [feature_size] → [accumulator_size] (per perspective)
↓
CReLU or SCReLU
↓
Concat: [accumulator_size × 2] (STM + NSTM perspectives)
↓
Hidden Layer₁ → CReLU → Hidden Layer₂ → ... → Hidden Layerₙ → CReLU
↓
Output Layer → 1 (evaluation score)
```
All dimensions are configured at runtime:
```rust
// Simple (single hidden layer)
let config = NnueConfig::new_static(530, 256, &[64], Activation::CReLU);
// Stockfish-style (multi-layer + SCReLU)
let config = NnueConfig::new_static(768, 1024, &[256, 32, 32], Activation::SCReLU);
```
## SIMD Acceleration
Inference is automatically accelerated on supported platforms:
| x86_64 | AVX2 | 256-bit (16 × i16) | Runtime |
| aarch64 | NEON | 128-bit (8 × i16) | Compile-time |
| Other | Scalar | — | Fallback |
No configuration needed — the fastest available path is selected automatically.
## API Reference
### `noru::config`
| `NnueConfig` | Network dimensions and activation type (borrowed or owned `hidden_sizes`) |
| `OwnedNnueConfig` | Runtime-constructible variant with `Vec<usize>` hidden sizes; convert via `.into_config()` |
| `Activation` | Activation function enum (`CReLU`, `SCReLU`) |
### `noru::ffi` (C ABI, optional)
NORU is built as a `cdylib` in addition to `rlib`, producing `libnoru.{so,dylib}` / `noru.dll`. The `noru::ffi` module exposes a C ABI surface for embedding in game engines and other non-Rust hosts:
- **Trainer**: `noru_trainer_new / free / forward / backward_bce / backward_raw_mse / zero_grad / adam_step`
- **Accumulator tree-search helpers**: `noru_accumulator_clone / copy_from / update_undo` for alpha-beta without snapshot allocation per node
- **Checkpoint**: `noru_trainer_save_fp32 / load_fp32` (FP32 weight serialization)
- **Quantize**: `noru_trainer_quantize` → `NoruWeights` for inference
- **Inference**: `noru_weights_load / save / free`, `noru_accumulator_new / refresh / update / swap / forward`
- **Errors**: `noru_last_error()` returns a thread-local C string for the most recent failure.
All FFI functions return an `i32` status code (`NORU_OK = 0`, negative values for errors) and catch panics at the boundary. See `src/ffi.rs` for the full surface.
### `noru::audit` (Quantization Drift)
| `AuditSample` | Borrowed feature lists for audit-only evaluation |
| `FeatureSet` | Trait for reusable STM/NSTM sample adapters |
| `QuantizationReport` | Aggregate sign/range/error metrics for FP32 vs i16 |
| `audit_quantized_model()` | Compare FP32 weights against a quantized model |
| `TrainableWeights::audit_quantization()` | Quantize and audit in one call |
| `NnueWeights::audit_against_fp32()` | Audit a saved/reloaded quantized model |
### `noru::network` (Inference, i16)
| `NnueWeights` | Quantized i16 weights for inference |
| `NnueWeights::load_from_bytes()` | Load weights from binary (v2 auto-detect) |
| `NnueWeights::save_to_bytes()` | Save weights to v2 binary format |
| `Accumulator` | Maintains per-perspective activation sums |
| `Accumulator::refresh()` | Full recomputation from feature list |
| `Accumulator::update_incremental()` | Efficient add/remove update |
| `Accumulator::swap()` | Swap STM/NSTM perspectives |
| `FeatureDelta` | Tracks added/removed features for incremental updates |
| `FeatureDelta::from_slices()` | Checked constructor that rejects overflow instead of truncating |
| `forward()` | Full forward pass: Accumulator → Hidden layers → Output |
### `noru::trainer` (Training, FP32)
| `TrainableWeights` | FP32 weights with training methods |
| `TrainableWeights::init_random()` | Kaiming initialization |
| `TrainableWeights::forward()` | FP32 forward pass with intermediate results |
| `TrainableWeights::backward_bce()` | Backpropagation (BCE loss) |
| `TrainableWeights::backward_raw_mse()` | Backpropagation (raw-output MSE loss) |
| `TrainableWeights::adam_update()` | Adam optimizer step |
| `TrainableWeights::quantize()` | FP32 → i16 for deployment |
| `AdamState` | Adam optimizer momentum/velocity state |
| `Gradients` | Gradient accumulation buffer |
| `TrainingSample` | Training data (features + target) |
| `SimpleRng` | Built-in xorshift64 RNG (no external dependency) |
## Development
Local checks:
```bash
cargo fmt --check
cargo test
cargo doc --no-deps
cargo package --allow-dirty --list
```
For contribution and support pathways, see [CONTRIBUTING.md](CONTRIBUTING.md),
[CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md), and [CITATION.cff](CITATION.cff).
## Publication
Draft software-paper materials live in [paper.md](paper.md),
[paper.bib](paper.bib), and
[documents/benchmark_inventory.md](documents/benchmark_inventory.md).
## Reproducibility
For reviewer-facing usage examples beyond the toy demos:
- [examples/feature_loop.rs](examples/feature_loop.rs) shows a small
board-style feature extractor loop on top of NORU's training and inference
APIs.
- [examples/ffi_embed.c](examples/ffi_embed.c) shows how a non-Rust host can
call the C ABI directly.
- [documents/adoption_evidence.md](documents/adoption_evidence.md) summarizes
verified public downstream evidence and clearly marks what is still local.
### `noru::simd`
| `vec_add_i16()` | Saturating i16 vector addition |
| `vec_sub_i16()` | Saturating i16 vector subtraction |
| `vec_clipped_relu()` | ClippedReLU activation (clamp to 0..127) |
| `dot_i16_i32()` | i16 dot product with i32 accumulation |
| `dot_screlu_i64()` | SCReLU squared dot product with i64 accumulation |
### `noru::quant`
| `WEIGHT_SCALE` (64) | FP32 → i16 quantization scale |
| `ACTIVATION_SCALE` (256) | Accumulator → Hidden scale |
| `OUTPUT_SCALE` (16) | Final output scale |
| `clipped_relu()` | ClippedReLU activation |
| `screlu_f32()` | Squared ClippedReLU (f32) |
| `saturate_i16()` | Safe i32 → i16 conversion |
## Building
```bash
# Library
cargo build --release
# Run tests
cargo test
# Generate documentation
cargo doc --open
```
## Design Decisions
- **No GPU** — Designed for real-time game AI on CPU. NNUE's strength is being fast enough for depth-4+ search on consumer hardware.
- **No external dependencies** — Even the RNG is built-in (xorshift64). This means `cargo add noru` just works, everywhere.
- **SCReLU on first layer only** — Following the Stockfish pattern, SCReLU is applied to the accumulator output. Subsequent hidden layers always use CReLU to avoid numerical issues in narrow layers.
- **Output-major weight layout** — Hidden layer weights are stored transposed (output-major) for contiguous SIMD memory access in dot products.
- **Vec\<T\> over fixed arrays** — All weights use heap-allocated vectors for runtime flexibility. Slight overhead vs compile-time arrays, but enables one binary for any game.
- **Sparse feature input** — Features are passed as active index lists, not dense vectors. This matches NNUE's design for board games where most features are inactive.
## License
Licensed under either of
- [MIT License](LICENSE-MIT)
- [Apache License, Version 2.0](LICENSE-APACHE)
at your option.
## Related Projects
- [Stockfish NNUE](https://github.com/official-stockfish/Stockfish) — The chess engine that popularized NNUE
- [bullet](https://github.com/jw1912/bullet) — GPU-accelerated NNUE training (Rust + CUDA)
- [Rapfi](https://github.com/dhbloo/rapfi) — Gomoku engine with advanced NNUE