Crate native_neural_network

Expand description

§rnn

Quick links: Overview · Architecture · Format · FFI · Compatibility · Production

rnn is a low-level Rust neural-network core built around explicit memory control, binary model formats, and FFI interoperability.

It is designed for native/embedded-style workflows where you want to control:

how model bytes are created,
how buffers are allocated,
how inference is executed,
and how the same core is reused across Rust and non-Rust runtimes.

§Table of Contents

What this project does
Why the generated neural network exists
Schema of the generated neural network
Real drawing of the generated network (actual sample values)
Conceptual schema of networks built by the library
Conceptual schema of model construction
How this network is created (exact pipeline)
Binary dense format (RMD1) details
RMD1 binary layout (concise spec)
Runtime parser path (RNN\0) and format split
Core inference execution model
Complete module reference
Public API groups (selected)
Compatibility matrix
Build and artifacts
Generate and validate a sample .rnn
FFI integration lifecycle
Performance notes
Security and safety notes
Versioning and stability policy
Project validation scripts
Testing status
FAQ
Production checklist
Contributing
License

§What this project does

This project provides end-to-end building blocks to:

Define dense network topology and layer specs
Validate parameter counts and index ranges
Serialize models into compact binary payloads
Deserialize/validate payloads safely
Run deterministic inference with caller-provided scratch buffers
Expose the same runtime through a C ABI

In addition to dense flow, the crate includes modules for attention, KV cache, RoPE, MoE routing, quantization, sampling, beam search, convolutions, normalization, and profiling/runtime estimation.

§Why the generated neural network exists

The sample generator in examples/generate_sample_model.rs exists to provide a deterministic, minimal artifact used for:

format validation,
API smoke checks,
FFI integration checks,
cross-language consistency checks.

It generates a tiny dense model with:

topology: [2, 1]
weights: [2.0, -1.0]
bias: [0.5]
activation: Identity

So the output is:

$$ y = 2.0 \cdot x_0 - 1.0 \cdot x_1 + 0.5 $$

This tiny model is intentionally simple so behavior is easy to verify in every language binding.

§Schema of the generated neural network

flowchart LR
  X0((x0)) --> N[Dense neuron]
  X1((x1)) --> N
  B((bias=0.5)) --> N
  N --> Y((y))

Parameter mapping for this sample:

w0 = 2.0 applied to x0
w1 = -1.0 applied to x1
b = 0.5
activation = Identity

§Real drawing of the generated network (actual sample values)

This is the exact neuron-level network generated by examples/generate_sample_model.rs:

graph LR
   x0((Input x0)) -- "w0 = +2.0" --> n1((Neuron n1))
   x1((Input x1)) -- "w1 = -1.0" --> n1
   b((Bias +0.5)) --> n1
   n1 -- "Identity" --> y((Output y))

Operationally, the neuron computes:

$$ z = (2.0 \cdot x_0) + (-1.0 \cdot x_1) + 0.5 $$

Because the output activation is Identity, the final output is:

$$ y = z $$

So for this generated sample model:

$$ y = 2.0 \cdot x_0 - x_1 + 0.5 $$

§Conceptual schema of networks built by the library

Beyond the tiny sample model, the core dense path implemented by this crate is conceptually a feed-forward stack of dense layers:

flowchart LR
   I[Input vector x] --> L1[Dense Layer 1\nW1 x + b1\nActivation a1]
   L1 --> L2[Dense Layer 2\nW2 h1 + b2\nActivation a2]
   L2 --> L3[... Optional hidden layers ...]
   L3 --> O[Output layer\nWn h(n-1) + bn\nOutput activation]

Each dense layer is represented internally with:

input_size
output_size
weight_offset
bias_offset
activation

Those descriptors are chained and validated before execution (LayerPlan::validate).

§Conceptual schema of model construction

The dense model creation flow is explicit and deterministic:

flowchart TD
   T[Topology\nexample: 2 -> 1 or 8 -> 16 -> 4] --> S[Build dense layer specs\ninput/output sizes + offsets + activations]
   P[Weights + Biases] --> S
   S --> V[Range/count validation\nweights_len and biases_len checks]
   V --> E[Encode binary model\nRMD1 header + layer metadata + tensors]
   E --> F[.rnn file payload]
   F --> D[Decode + validate at runtime]
   D --> R[Run inference with explicit scratch buffers]

Why this design:

predictable memory behavior (no hidden runtime allocations in core path),
strict structural checks before compute,
straightforward interop with FFI consumers.

§How this network is created (exact pipeline)

§Step 1: Topology and parameters

topology = [2, 1]
user-provided weights, biases

§Step 2: Build layer specs

build_dense_specs_from_layers computes for each layer:

input_size, output_size
weight_offset, bias_offset
activation choice (hidden vs output)

It also validates consistency with total weights/biases.

§Step 3: Encode binary payload

encode_dense_model_v1 writes:

magic/version/header
layer metadata
packed weights
packed biases

§Step 4: Persist bytes

The example writes the result as a .rnn file.

§Step 5: Runtime consumption

At inference time:

rnn_required_dense_from_bytes_v1 inspects required counts
decode_dense_model_v1 reconstructs layer specs/parameters
forward_dense_plan executes with caller scratch buffers

§Binary dense format (`RMD1`) details

Dense format helpers are in src/model_format and src/rnn_api.

Key characteristics:

Magic: RMD1
Versioned header
Layer metadata contains input/output sizes, offsets, activation id
All critical ranges are validated before use
Decode fails on truncation, bad version/magic, invalid offsets, or capacity mismatch

This gives a strict producer/consumer contract for dense models.

§RMD1 binary layout (concise spec)

Dense RMD1 payload layout used by model_format:

Header (20 bytes total):
- magic (4 bytes): RMD1
- version (u16)
- flags (u16, currently reserved)
- layer_count (u32)
- weights_len (u32)
- biases_len (u32)
Layer metadata array (layer_count entries, 20 bytes each):
- input_size (u32)
- output_size (u32)
- weight_offset (u32)
- bias_offset (u32)
- activation (u8)
- reserved (3 bytes)
Weights payload (weights_len * 4 bytes, f32 little-endian)
Biases payload (biases_len * 4 bytes, f32 little-endian)

Validation guarantees include:

non-zero dimensions,
checked offset arithmetic,
bounds checks against tensor payload lengths,
truncation/version/magic checks at decode.

§Runtime parser path (`RNN\0`) and format split

The repository also contains parser utilities in src/rnn_format with RNN\0 magic.

So there are two format domains in the project:

Dense model serialization path (RMD1)
Runtime blob parser path (RNN\0)

This is intentional in code, but requires clear pipeline discipline in production.

§Core inference execution model

Dense execution path is explicit and buffer-oriented:

Validate plan and shape chain
Compute scratch requirement from max width and batch size
Use two alternating scratch lanes for layer-by-layer forward pass
Copy final lane into output buffer

This avoids hidden execution state and keeps runtime behavior predictable.

§Complete module reference

§Core execution

network: network-level checks and stats
layers: layer descriptors, chaining/range validation, topology→spec conversion
engine: dense forward kernels, scratch sizing, shape checks
inference: batch forward wrappers, stable softmax and logits helpers
runtime: memory/flops/throughput/budget estimators
model_config: predefined config helpers

§Tensor and numerics

tensor: tensor views, indexing, layout checks
scratch: temporary memory helpers
activations: activation kinds and vector application
normalization: layer norm / RMS norm
quantization: i8/f32 quant/dequant and mixed matmul
math (in src/lib.rs): no-std-friendly approximations

§Training-adjacent

losses: loss and reduction logic
metrics: MSE/MAE/accuracy/argmax and running means
gradients: norm, clipping, finite checks
optimizers: optimizer update paths
schedulers: LR scheduling
trainer: SGD-oriented step helpers
initializers: parameter count/init helpers

§Transformer-style blocks

attention: scaled dot-product attention + masks/shapes
kv_cache: KV cache views/errors
rope: rotary position embedding application
sampling: temperature/top-k/top-p sampling primitives
beam_search: beam selection utilities
moe: top-1 gating and routing
embeddings: embedding gather and tied projection
lora: LoRA delta application

§Spatial/specialized operators

conv3d: 3D convolution and compatibility checks
conv5d: 5D convolution forward/backward
sphere5d: 5D sphere structures/helpers
batching: padding and mask generation

§Formats and interop

model_format: dense model encoding/decoding (RMD1)
rnn_api: high-level dense lifecycle APIs
rnn_format: runtime blob parser (RNN\0)
ffi_api: C ABI implementation
public_api: re-exported public surface
crypto: hashing/integrity helpers
profiler: operation counting helpers

§Legacy note

embedings exists as a legacy spelling path in repository history/structure.

§Public API groups (selected)

The crate re-exports many symbols through src/public_api.rs.

Examples by category:

Dense lifecycle: rnn_required_dense_from_bytes_v1, rnn_pack_dense_v1, rnn_run_dense_v1
Format: encode_dense_model_v1, decode_dense_model_v1, encoded_size_v1
Inference ops: forward_dense_batch, scaled_dot_product_attention, apply_rope_in_place
Optimization: dense_sgd_step, apply_optimizer_step, clip_by_global_norm
Runtime estimates: estimate_runtime_memory, estimate_runtime_flops, check_runtime_budget
FFI C API: model create/run/destroy + ABI checks in include/rnn_ffi.h

§Compatibility matrix

This project is designed to be compatible across all major desktop/server OSes:

Platform	Rust crate build	FFI artifacts	Notes
Linux	Supported	Supported (`.so`, `.a`)	Primary native flow
macOS	Supported	Supported (`.dylib`, `.a`)	Standard clang/ld toolchain
Windows	Supported	Supported (`.dll`, `.lib`)	MSVC/MinGW depending on toolchain

General requirements:

Rust stable toolchain
C/C++ toolchain when consuming FFI outputs
Platform-specific linker/runtime setup for shared libraries

§Build and artifacts

Build:

cargo build
cargo build --release

With current crate config, release builds can emit Rust + native artifacts according to platform/toolchain (rlib, cdylib, staticlib).

§Generate and validate a sample `.rnn`

Generate:

cargo run --example generate_sample_model -- /tmp/sample.rnn

Sanity-check:

ls -lh /tmp/sample.rnn
xxd -l 4 /tmp/sample.rnn

Expected dense header bytes correspond to RMD1.

§FFI integration lifecycle

C header: include/rnn_ffi.h

Recommended host flow:

rnn_ffi_api_version / rnn_ffi_is_abi_compatible
rnn_ffi_model_create_from_bytes_v1
rnn_ffi_model_get_info
rnn_ffi_model_run_dense or rnn_ffi_model_run_dense_batch
rnn_ffi_model_destroy

§Performance notes

Dense forward cost is dominated by matrix-vector products per layer.
For dense stacks, per-sample compute is approximately proportional to:

$$ \sum_{l=1}^{L} (\text{in}_l \times \text{out}_l) $$

Batch mode reuses the same plan and alternates scratch lanes for better locality.
Scratch requirements scale with batch_size * max_layer_width * 2 in the current engine path.
Quantization and runtime estimation modules can be used to pre-plan deployment budgets.

§Security and safety notes

Never trust external model bytes by default.
Always validate incoming payloads before inference (required_* and decode checks).
Keep ABI checks enabled in cross-language hosts (rnn_ffi_is_abi_compatible).
Treat model files as untrusted input in service contexts (sandbox, size limits, resource guards).
Keep check_abi_contract.sh in CI if you publish FFI artifacts.

§Versioning and stability policy

Rust crate API should follow semantic versioning for public surface changes.
C ABI changes should be treated as compatibility-sensitive and version-gated.
Model format changes (RMD1) should be versioned explicitly and decoded defensively.
Breaking changes should be documented in release notes and migration guidance.

§Project validation scripts

scripts/check_abi_contract.sh: validates expected ABI symbols
scripts/prod_ready_check.sh: broad production-style checks

Note: prod_ready_check.sh references optional wrapper ecosystems (wrappers/python, wrappers/javascript, wrappers/java, wrappers/cpp) and related tooling.

§Subtleties and design constraints

These are important, non-obvious project subtleties:

no_std core behavior The crate is intentionally low-level and optimized for explicit runtime control.
Dual format domain (RMD1 and RNN\0) Dense serialization and runtime blob parsing are separate concerns and must be selected deliberately per pipeline.
Explicit scratch management Inference APIs rely on caller-allocated buffers. This is by design for deterministic memory behavior.
Strict range validation Layer offsets, dimensions, and capacities are validated before execution to prevent unsafe indexing paths.
FFI ABI contract stability matters Any C ABI change must stay synchronized between src/ffi_api and include/rnn_ffi.h.
Repository currently includes broad domain modules The crate is not a tiny single-purpose dense runner; it is a wide NN systems toolbox.

§Project validation scripts

scripts/check_abi_contract.sh: validates expected ABI symbols
scripts/prod_ready_check.sh: broad production-style checks

Note: prod_ready_check.sh references optional wrapper ecosystems (wrappers/python, wrappers/javascript, wrappers/java, wrappers/cpp) and related tooling.

§Testing status

As requested for this repository:

no in-repo unit-test focus is currently documented here,
a dedicated std wrapper crate is planned,
all unit tests are intended to be centralized in that wrapper.

§FAQ

§Why `no_std`?

To keep the core deterministic and portable for constrained/native runtimes.

§Why both `RMD1` and `RNN\0` paths?

They represent two format domains in the repository (dense serialization vs runtime parser utilities). Keep pipeline usage explicit.

§Why a separate `std` wrapper for unit tests?

To keep this core focused on runtime/format/FFI behavior while enabling richer testing ergonomics in a host-friendly crate.

§Can I use this on Windows/Linux/macOS?

Yes. The crate and FFI flow are designed for all three platforms with standard Rust + native toolchains.

§Production checklist

Build release artifacts (cargo build --release)
Validate ABI contract (scripts/check_abi_contract.sh)
Generate and verify sample model (examples/generate_sample_model.rs)
Verify FFI lifecycle in your host runtime (create/run/destroy)
Apply resource limits and input validation for model loading
Track runtime budgets (memory/FLOPs/throughput) before deployment

§Contributing

Contributions are welcome.

Suggested local checks:

cargo fmt --all
cargo clippy --all-targets -- -D warnings
cargo build --release

For major changes, open an issue first with:

scope,
impacted modules,
compatibility expectations.

§License

MIT.

See LICENSE.

Re-exports§

pub use crate::network::NeuralNetwork;
pub use crate::network::NetworkStats;
pub use crate::network::network_stats;
pub use crate::network::validate_network_parts;
pub use crate::tensor::TensorView;
pub use crate::tensor::tensor_fill;
pub use crate::tensor::tensor_scale_in_place;
pub use crate::tensor::tensor_add_in_place;
pub use crate::scratch::Scratch;
pub use crate::rnn_format::parse_rnn_from_bytes;
pub use crate::rnn_format::RnnHandle;
pub use crate::rnn_format::BlobMeta;
pub use crate::rnn_api::RnnApiError;
pub use crate::rnn_api::rnn_required_dense_from_topology;
pub use crate::rnn_api::rnn_required_dense_from_bytes_v1;
pub use crate::rnn_api::rnn_dense_required_buffers;
pub use crate::rnn_api::rnn_dense_required_infer_scratch_from_specs;
pub use crate::rnn_api::rnn_validate_dense_topology;
pub use crate::rnn_api::rnn_validate_dense_counts;
pub use crate::rnn_api::rnn_pack_dense_v1;
pub use crate::rnn_api::rnn_unpack_dense_v1;
pub use crate::rnn_api::rnn_run_dense_v1;
pub use crate::crypto::Sha256Ctx;
pub use crate::crypto::Sha512Ctx;
pub use crate::crypto::sha256_bytes;
pub use crate::crypto::sha512_bytes;
pub use crate::crypto::digest_to_hex_lower;
pub use crate::crypto::constant_time_eq;
pub use crate::crypto::verify_sha256;
pub use crate::crypto::verify_sha512;
pub use crate::conv3d::conv3d_forward;
pub use crate::conv3d::conv3d_layout_compatible;
pub use crate::conv3d::conv3d_is_compatible;
pub use crate::conv5d::conv5d_forward;
pub use crate::conv5d::conv5d_backward;
pub use crate::sphere5d::Sphere5D;
pub use crate::sphere5d::NeuronPoint;
pub use crate::sphere5d::SphereError;
pub use crate::activations::ActivationKind;
pub use crate::layers::LayerSpec;
pub use crate::layers::DenseLayerDesc;
pub use crate::layers::LayerPlan;
pub use crate::layers::LayerError;
pub use crate::engine::forward_dense_plan;
pub use crate::engine::forward_dense_plan_big_kernel;
pub use crate::engine::required_batch_scratch_len;
pub use crate::engine::ForwardError;
pub use crate::engine::required_single_infer_scratch;
pub use crate::engine::validate_forward_io;
pub use crate::model_format::encode_dense_model_v1;
pub use crate::model_format::decode_dense_model_v1;
pub use crate::model_format::encoded_size_v1;
pub use crate::model_format::DecodedCounts;
pub use crate::model_format::ModelFormatError;
pub use crate::losses::LossKind;
pub use crate::losses::LossError;
pub use crate::losses::loss_and_gradient;
pub use crate::losses::reduce_sum;
pub use crate::losses::reduce_mean;
pub use crate::metrics::MetricError;
pub use crate::metrics::mse;
pub use crate::metrics::mae;
pub use crate::metrics::argmax;
pub use crate::metrics::accuracy_top1_from_one_hot;
pub use crate::metrics::cross_entropy_from_probabilities;
pub use crate::metrics::RunningMean;
pub use crate::initializers::InitKind;
pub use crate::initializers::InitError;
pub use crate::initializers::expected_parameter_counts;
pub use crate::initializers::initialize_dense_parameters;
pub use crate::inference::InferenceError;
pub use crate::inference::softmax_stable;
pub use crate::inference::forward_dense_batch;
pub use crate::inference::normalize_logits_in_place;
pub use crate::inference::argmax_index;
pub use crate::trainer::DenseSgdConfig;
pub use crate::trainer::TrainError;
pub use crate::trainer::required_train_buffer_len;
pub use crate::trainer::dense_sgd_step;
pub use crate::optimizers::OptimizerKind;
pub use crate::optimizers::OptimizerError;
pub use crate::optimizers::optimizer_state_len;
pub use crate::optimizers::apply_optimizer_step;
pub use crate::schedulers::LrSchedule;
pub use crate::schedulers::ScheduleError;
pub use crate::schedulers::compute_learning_rate;
pub use crate::normalization::NormError;
pub use crate::normalization::layer_norm_in_place;
pub use crate::normalization::layer_norm;
pub use crate::normalization::rms_norm_in_place;
pub use crate::normalization::rms_norm;
pub use crate::attention::AttentionError;
pub use crate::attention::AttentionMask;
pub use crate::attention::AttentionShape;
pub use crate::attention::scaled_dot_product_attention;
pub use crate::quantization::QuantError;
pub use crate::quantization::quantize_i8_symmetric;
pub use crate::quantization::dequantize_i8_symmetric;
pub use crate::quantization::matmul_i8_f32;
pub use crate::model_config::TransformerConfig;
pub use crate::model_config::ConfigError;
pub use crate::model_config::tiny_transformer;
pub use crate::model_config::small_transformer;
pub use crate::model_config::base_transformer;
pub use crate::runtime::RuntimeProfile;
pub use crate::runtime::RuntimeEstimate;
pub use crate::runtime::RuntimeError;
pub use crate::runtime::RuntimeFlopsEstimate;
pub use crate::runtime::ThroughputEstimate;
pub use crate::runtime::BudgetFit;
pub use crate::runtime::estimate_runtime_memory;
pub use crate::runtime::estimate_runtime_flops;
pub use crate::runtime::estimate_tokens_per_second;
pub use crate::runtime::check_runtime_budget;
pub use crate::runtime::fit_from_estimate;
pub use crate::sampling::SamplingError;
pub use crate::sampling::softmax_temperature;
pub use crate::sampling::argmax_sample;
pub use crate::sampling::sample_from_cumulative;
pub use crate::sampling::top_k_mask;
pub use crate::sampling::top_p_cutoff;
pub use crate::kv_cache::KvCacheError;
pub use crate::kv_cache::KvCacheView;
pub use crate::rope::RopeError;
pub use crate::rope::apply_rope_in_place;
pub use crate::embeddings::EmbeddingError;
pub use crate::embeddings::gather_embeddings;
pub use crate::embeddings::tied_output_projection;
pub use crate::lora::LoraError;
pub use crate::lora::apply_lora_delta;
pub use crate::moe::MoeError;
pub use crate::moe::top1_gating;
pub use crate::moe::route_top1;
pub use crate::beam_search::BeamError;
pub use crate::beam_search::select_top_beams;
pub use crate::gradients::GradientError;
pub use crate::gradients::l2_norm;
pub use crate::gradients::clip_by_global_norm;
pub use crate::gradients::all_finite;
pub use crate::batching::BatchError;
pub use crate::batching::pad_sequences_u32;
pub use crate::batching::make_padding_mask;
pub use crate::profiler::OpCounter;