NORU

NNUE On RUst — Zero-dependency NNUE training & inference library in pure Rust.

What is NNUE?

NNUE (Efficiently Updatable Neural Network) is a neural network architecture designed for fast evaluation in game engines. Originally developed for Shogi and adopted by Stockfish, NNUE enables real-time neural network inference through incremental accumulator updates.

What is NORU?

NORU is a game-agnostic NNUE library that provides both training and inference in a single, dependency-free Rust crate. Configure your network dimensions at runtime via NnueConfig — no recompilation needed.

Why this library?

Most NNUE code in the wild lives inside a specific chess engine (Stockfish, Rapfi, …) and is hard-wired to that engine's feature layout in C++. Applying NNUE to a different game — Gomoku, Connect 4, a tactical hex-grid battler — traditionally means either forking one of those engines and rewriting its feature encoder, or re-implementing training from scratch in Python with PyTorch and then writing a separate C++ inference path for deployment.

NORU collapses that pipeline into one pure-Rust crate:

One crate for training and inference. FP32 backprop with Adam for training, i16 quantized forward pass with SIMD acceleration for deployment. You don't leave the Rust toolchain.
No learned assumptions about chess. NnueConfig decouples feature_size, accumulator_size, hidden_sizes, and the activation function from the binary layout, so the same crate serves a 4096-feature Gomoku encoder and a 138-feature hex-grid encoder without code changes.
No dependencies. Including the RNG (xorshift64). cargo add noru just works on any platform Rust supports, including WebAssembly and ARM embedded targets.
Deployment-ready. A cdylib build + the noru::ffi layer exposes the inference API to Unity, Godot, C#, and C++ so the same trained weights can ship into a game engine without a Python runtime.

The design target is game AI developers who want Stockfish-class evaluation quality for non-chess domains without paying the integration cost of the chess-engine ecosystem.

Key Features

Multi-hidden-layer — Arbitrary depth networks (e.g. &[256, 32, 32])
CReLU + SCReLU — Squared Clipped ReLU for stronger accumulator activation
SIMD-accelerated inference — AVX2 (x86_64), NEON (aarch64), with scalar fallback
Training + Inference — FP32 backpropagation with Adam optimizer, i16 quantized inference
Zero dependencies — Pure Rust, no PyTorch, no CUDA, no C bindings
Game-agnostic — Runtime-configurable network dimensions via NnueConfig
Incremental updates — Efficient accumulator add/remove for search trees
Quantization — Automatic FP32 → i16 conversion for deployment
Binary format v2 — Versioned model serialization with auto-detection
C ABI / FFI layer — cdylib build + noru::ffi for embedding in Unity, Godot, C#, C++

Quick Start

Add to your Cargo.toml:

[dependencies]
noru = "2.0"

Training

use noru::config::{NnueConfig, Activation};
use noru::trainer::{TrainableWeights, AdamState, Gradients, TrainingSample, SimpleRng};

// 1. Define your network dimensions
let config = NnueConfig::new_static(
    530,               // your game's feature count
    256,               // hidden accumulator neurons
    &[64],             // hidden layer sizes (multi-layer: &[256, 32, 32])
    Activation::CReLU, // or Activation::SCReLU
);

// 2. Initialize weights
let mut rng = SimpleRng::new(42);
let mut weights = TrainableWeights::init_random(config.clone(), &mut rng);
let mut adam = AdamState::new(config.clone());

// 3. Train on samples
let sample = TrainingSample {
    stm_features: vec![0, 42, 100],   // active feature indices (side-to-move)
    nstm_features: vec![10, 50, 200], // active feature indices (opponent)
    target: 0.8,                       // evaluation target
};

let fwd = weights.forward(&sample.stm_features, &sample.nstm_features);
let mut grad = Gradients::new(config);
weights.backward_bce(&sample, &fwd, &mut grad);  // BCE loss, target in [0, 1]
// or for raw eval regression:
// weights.backward_raw_mse(&sample, &fwd, &mut grad);
weights.adam_update(&grad, &mut adam, 0.001, 1.0);

// 4. Quantize for deployment
let inference_weights = weights.quantize(); // FP32 → i16

Inference

use noru::config::{NnueConfig, Activation};
use noru::network::{NnueWeights, Accumulator, FeatureDelta, forward};

// Load quantized weights (v2 format auto-detected)
let weights = NnueWeights::load_from_bytes(&model_bytes, None)?;

// Or with legacy format (requires config)
let config = NnueConfig::new_static(530, 256, &[64], Activation::CReLU);
let weights = NnueWeights::load_from_bytes(&model_bytes, Some(config))?;

// Evaluate a position
let mut acc = Accumulator::new(&weights.feature_bias);
acc.refresh(&weights, &stm_features, &nstm_features);
let eval: i32 = forward(&acc, &weights);

// Incremental update (for search trees)
let delta_stm = FeatureDelta::from_slices(&[new_feature], &[old_feature])?;
let delta_nstm = FeatureDelta::new();
acc.update_incremental(&weights, &delta_stm, &delta_nstm);

Save / Load Models

// Save
let bytes = weights.save_to_bytes(); // v2 format with NORU header
std::fs::write("model.bin", &bytes)?;

// Load (auto-detects v2 header)
let data = std::fs::read("model.bin")?;
let weights = NnueWeights::load_from_bytes(&data, None)?;

Examples

Runnable examples live in examples/. Each is a self-contained binary you can clone and run without a separate game engine:

# Minimal training → quantization → inference round trip (4-feature toy problem)
cargo run --release --example xor

# Multi-hidden-layer network with SCReLU activation
cargo run --release --example multi_layer

# FP32 → i16 → save → load → inference, reports quantization audit metrics
cargo run --release --example quantize_roundtrip

# Mini board state -> sparse feature extraction -> training/inference loop
cargo run --release --example feature_loop

Applications

NORU has been validated across three games of different branching factors and feature encodings, which is the primary evidence that the runtime-configurable design generalizes beyond chess:

Gomoku (15×15 Five-in-a-Row). figrid-board v0.4.x ships a pbrain/Piskvork-compatible Gomocup engine (pbrain-figrid-noru) built on NORU. Feature set: 4096 (PS + LP-Rich + Compound threats + Density). Configuration: accumulator 512 → hidden 64 → output. Gomocup 2026 submission target. Repo: https://github.com/nicotina04/figrid-board.
Hex-grid tactical battler. An auto-extraction RPG combat engine uses NORU for unit-placement evaluation. Feature set: 138 (position-independent, per-class + global). Configuration: accumulator 256 → hidden 64 → output. Demonstrates that non-board-game domains fit the same API.
Connect 4. A minimal second game used as an ablation target to confirm generality; reaches ~45% win rate against a depth-matched heuristic after a few hours of training.

These three share the identical noru crate — only NnueConfig and the feature extractor differ per domain.

Evidence notes and current public/private artifact status are tracked in documents/adoption_evidence.md and documents/benchmark_inventory.md.

Architecture

Input (sparse features)
  ↓
Feature Transform: [feature_size] → [accumulator_size] (per perspective)
  ↓
CReLU or SCReLU
  ↓
Concat: [accumulator_size × 2] (STM + NSTM perspectives)
  ↓
Hidden Layer₁ → CReLU → Hidden Layer₂ → ... → Hidden Layerₙ → CReLU
  ↓
Output Layer → 1 (evaluation score)

All dimensions are configured at runtime:

// Simple (single hidden layer)
let config = NnueConfig::new_static(530, 256, &[64], Activation::CReLU);

// Stockfish-style (multi-layer + SCReLU)
let config = NnueConfig::new_static(768, 1024, &[256, 32, 32], Activation::SCReLU);

SIMD Acceleration

Inference is automatically accelerated on supported platforms:

Platform	Instruction Set	Width	Auto-detected
x86_64	AVX2	256-bit (16 × i16)	Runtime
aarch64	NEON	128-bit (8 × i16)	Compile-time
Other	Scalar	—	Fallback

No configuration needed — the fastest available path is selected automatically.

API Reference

`noru::config`

Type	Description
`NnueConfig`	Network dimensions and activation type (borrowed or owned `hidden_sizes`)
`OwnedNnueConfig`	Runtime-constructible variant with `Vec<usize>` hidden sizes; convert via `.into_config()`
`Activation`	Activation function enum (`CReLU`, `SCReLU`)

`noru::ffi` (C ABI, optional)

NORU is built as a cdylib in addition to rlib, producing libnoru.{so,dylib} / noru.dll. The noru::ffi module exposes a C ABI surface for embedding in game engines and other non-Rust hosts:

Trainer: noru_trainer_new / free / forward / backward_bce / backward_raw_mse / zero_grad / adam_step
Accumulator tree-search helpers: noru_accumulator_clone / copy_from / update_undo for alpha-beta without snapshot allocation per node
Checkpoint: noru_trainer_save_fp32 / load_fp32 (FP32 weight serialization)
Quantize: noru_trainer_quantize → NoruWeights for inference
Inference: noru_weights_load / save / free, noru_accumulator_new / refresh / update / swap / forward
Errors: noru_last_error() returns a thread-local C string for the most recent failure.

All FFI functions return an i32 status code (NORU_OK = 0, negative values for errors) and catch panics at the boundary. See src/ffi.rs for the full surface.

`noru::audit` (Quantization Drift)

Type / Function	Description
`AuditSample`	Borrowed feature lists for audit-only evaluation
`FeatureSet`	Trait for reusable STM/NSTM sample adapters
`QuantizationReport`	Aggregate sign/range/error metrics for FP32 vs i16
`audit_quantized_model()`	Compare FP32 weights against a quantized model
`TrainableWeights::audit_quantization()`	Quantize and audit in one call
`NnueWeights::audit_against_fp32()`	Audit a saved/reloaded quantized model

`noru::network` (Inference, i16)

Type / Function	Description
`NnueWeights`	Quantized i16 weights for inference
`NnueWeights::load_from_bytes()`	Load weights from binary (v2 auto-detect)
`NnueWeights::save_to_bytes()`	Save weights to v2 binary format
`Accumulator`	Maintains per-perspective activation sums
`Accumulator::refresh()`	Full recomputation from feature list
`Accumulator::update_incremental()`	Efficient add/remove update
`Accumulator::swap()`	Swap STM/NSTM perspectives
`FeatureDelta`	Tracks added/removed features for incremental updates
`FeatureDelta::from_slices()`	Checked constructor that rejects overflow instead of truncating
`forward()`	Full forward pass: Accumulator → Hidden layers → Output

`noru::trainer` (Training, FP32)

Type / Function	Description
`TrainableWeights`	FP32 weights with training methods
`TrainableWeights::init_random()`	Kaiming initialization
`TrainableWeights::forward()`	FP32 forward pass with intermediate results
`TrainableWeights::backward_bce()`	Backpropagation (BCE loss)
`TrainableWeights::backward_raw_mse()`	Backpropagation (raw-output MSE loss)
`TrainableWeights::adam_update()`	Adam optimizer step
`TrainableWeights::quantize()`	FP32 → i16 for deployment
`AdamState`	Adam optimizer momentum/velocity state
`Gradients`	Gradient accumulation buffer
`TrainingSample`	Training data (features + target)
`SimpleRng`	Built-in xorshift64 RNG (no external dependency)

Development

Local checks:

cargo fmt --check
cargo test
cargo doc --no-deps
cargo package --allow-dirty --list

For contribution and support pathways, see CONTRIBUTING.md, CODE_OF_CONDUCT.md, and CITATION.cff.

Publication

Draft software-paper materials live in paper.md, paper.bib, and documents/benchmark_inventory.md.

Reproducibility

For reviewer-facing usage examples beyond the toy demos:

examples/feature_loop.rs shows a small board-style feature extractor loop on top of NORU's training and inference APIs.
examples/ffi_embed.c shows how a non-Rust host can call the C ABI directly.
documents/adoption_evidence.md summarizes verified public downstream evidence and clearly marks what is still local.

`noru::simd`

Function	Description
`vec_add_i16()`	Saturating i16 vector addition
`vec_sub_i16()`	Saturating i16 vector subtraction
`vec_clipped_relu()`	ClippedReLU activation (clamp to 0..127)
`dot_i16_i32()`	i16 dot product with i32 accumulation
`dot_screlu_i64()`	SCReLU squared dot product with i64 accumulation

`noru::quant`

Constant / Function	Description
`WEIGHT_SCALE` (64)	FP32 → i16 quantization scale
`ACTIVATION_SCALE` (256)	Accumulator → Hidden scale
`OUTPUT_SCALE` (16)	Final output scale
`clipped_relu()`	ClippedReLU activation
`screlu_f32()`	Squared ClippedReLU (f32)
`saturate_i16()`	Safe i32 → i16 conversion

Building

# Library
cargo build --release

# Run tests
cargo test

# Generate documentation
cargo doc --open

Design Decisions

No GPU — Designed for real-time game AI on CPU. NNUE's strength is being fast enough for depth-4+ search on consumer hardware.
No external dependencies — Even the RNG is built-in (xorshift64). This means cargo add noru just works, everywhere.
SCReLU on first layer only — Following the Stockfish pattern, SCReLU is applied to the accumulator output. Subsequent hidden layers always use CReLU to avoid numerical issues in narrow layers.
Output-major weight layout — Hidden layer weights are stored transposed (output-major) for contiguous SIMD memory access in dot products.
Vec<T> over fixed arrays — All weights use heap-allocated vectors for runtime flexibility. Slight overhead vs compile-time arrays, but enables one binary for any game.
Sparse feature input — Features are passed as active index lists, not dense vectors. This matches NNUE's design for board games where most features are inactive.

License

Licensed under either of

at your option.

Related Projects

Stockfish NNUE — The chess engine that popularized NNUE
bullet — GPU-accelerated NNUE training (Rust + CUDA)
Rapfi — Gomoku engine with advanced NNUE

noru 2.0.0