iron_learn 0.6.1

A simple Rust ML library with GPU-accelerated gradient descent. Supports tensors, complex numbers, linear/logistic regression, and CUDA optimization.
Documentation

Iron Learn

A Rust machine learning library with optional GPU-accelerated optimization. Built for learning tensor operations, gradient-based algorithms, and numerical computing with an emphasis on type safety and correctness.

Features

  • GPU-Accelerated Training: CUDA kernels for Tensor operations. Need to explicitly enable the feature. More on this later.
  • Comprehensive Tensor Support: Two-dimensional arrays with generic numeric types
  • Complex Number Arithmetic: Native support for complex-valued computations
  • Zero-Copy Operations: Borrowing methods for efficient computation reuse

Network Output Sample

Let's take a look on what the library can do. In one of the POCs with a ~99K parameter vanilla neural network, I have tried the Universal Approximation Theorem to reconstruct an image. Following are few snaps from the training of a very complex funtion (an image of Simba).

The random noise the network started with

Starting Random Noise

Reconstructed image after 200,000 epochs

Intermediate Reconstruction

Reconstructed image after 800,000 epochs

Intermediate Reconstruction

For comparison, Following is the original image fed to the network.

Original Image

Original Image

Time lapse of regeneration step by step

Time lapse of regeneration

You can find all the regenerated images in image/imagesdirectory.

Quick Start

Installation

Add to your Cargo.toml:

[dependencies]

iron_learn = "0.6"

Basic Tensor Usage Example

use iron_learn::CpuTensor;
use iron_learn::Tensor;

// Create 2x2 matrices
let a: CpuTensor<f32> = CpuTensor::new(vec![2, 2], vec![1.0, 2.0, 3.0, 4.0]).unwrap();
let b: CpuTensor<f32> = CpuTensor::new(vec![2, 2], vec![5.0, 6.0, 7.0, 8.0]).unwrap();

println!("Original A:");
a.print_matrix();

println!("Original B:");
b.print_matrix();

// Add tensors without move
let sum = a.add(&b).unwrap();
println!("Sum:");
sum.print_matrix();

// Subtract tensors without move
let sum = a.sub(&b).unwrap();
println!("Difference:");
sum.print_matrix();

// Multiply tensors (Matrix multiplication)
let product = a.mul(&b).unwrap();
println!("Product:");
product.print_matrix();

// Multiply tensors (Element wise multiplication)
let product = a.multiply(&b).unwrap();
println!("Hadamard Product:");
product.print_matrix();

// Divide tensors (Element wise division)
let result = a.div(&b).unwrap();
println!("Division:");
result.print_matrix();

// Transpose
let t = a.t().unwrap();
println!("Transpose of A:");
t.print_matrix();

For detailed example, check out Examples

How to build and run

Prerequisites:

  • Rust toolchain (stable or nightly depending on local setup): https://rustup.rs
  • For GPU builds: CUDA Toolkit and an NVIDIA GPU (driver + nvcc). If you don't plan to use GPU tensors, CPU-only build is fine.
  • Python 3.8+ for the scripts (optional). cupy is needed for running python_scripts.

The library has two modes - CPU and Cuda. Default is CPU, Cuda comes as an optional feature. If you have CUDA environment setup, you can use --features=cuda flag for building and running the code.

Build the basic CPU based Rust library and examples:

cargo build --release

Run unit tests:

cargo test

Run the Rust demonstration runners (examples):

# This is to get a comprehensive documentation of all the CLI Flags

cargo run -- -h

# Run a neural network example

cargo run 

If you want to enable CUDA-backed tensors, ensure your environment has CUDA installed and visible to the linker. Typical workflow is the same cargo build --features=cuda but the code will initialize GPU devices at runtime when init_gpu() or init_context() is invoked.

Python scripts can be run directly:

python python_scripts/k-means.py

python anomaly_detection.py

python_scripts/check_cuda.py is a small utility to detect/validate CUDA availability from Python; useful for quick GPU checks.

Examples

Use GPU Tensor

use iron_learn::init_gpu; // Only available under cuda feature.

use iron_learn::GpuTensor;
use iron_learn::Tensor;
use iron_learn::init_context;

init_gpu();
// Create 2x2 matrices
let a: GpuTensor<f32> = GpuTensor::new(vec![2, 2], vec![1.0, 2.0, 3.0, 4.0]).unwrap();
let b: GpuTensor<f32> = GpuTensor::new(vec![2, 2], vec![5.0, 6.0, 7.0, 8.0]).unwrap();

println!("Original A:");
a.print_matrix();

println!("Original B:");
b.print_matrix();

// Add tensors without move
let sum = a.add(&b).unwrap();
println!("Sum:");
sum.print_matrix();

// Subtract tensors without move
let sum = a.sub(&b).unwrap();
println!("Difference:");
sum.print_matrix();

// Multiply tensors (Matrix multiplication)
let product = a.mul(&b).unwrap();
println!("Product:");
product.print_matrix();

// Multiply tensors (Element wise multiplication)
let product = a.multiply(&b).unwrap();
println!("Hadamard Product:");
product.print_matrix();

// Divide tensors (Element wise division)
let result = a.div(&b).unwrap();
println!("Division:");
result.print_matrix();

// Transpose
let t = a.t().unwrap();
println!("Transpose of A:");
t.print_matrix();

Use Neural Network

use iron_learn::CpuTensor;
use iron_learn::MeanSquaredErrorLoss;
use iron_learn::NeuralNet;
use iron_learn::NeuralNetBuilder;
use iron_learn::neural_network::DistributionType;
use iron_learn::neural_network::LayerType;
use iron_learn::Tensor;

let mut nn = NeuralNetBuilder::<CpuTensor<f32>>::new();
    
let x: CpuTensor<f32> = CpuTensor::new(vec![4, 2], vec![0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0]).unwrap();
let y: CpuTensor<f32> = CpuTensor::new(vec![4, 1], vec![0.0, 1.0, 1.0, 0.0]).unwrap();

let monitor = |epoch: usize, err: f32, current_lr: f32, nn: &mut NeuralNet<CpuTensor<f32>>| {
    println!("Processing epopch: {epoch}, error: {err}");
};
	
nn.add_linear(2, 4, "Input", &DistributionType::Uniform);
nn.add_activation(LayerType::Tanh, "Activation Layer");
	
nn.add_linear(4, 1, "Hidden", &DistributionType::Uniform);
nn.add_activation(LayerType::Sigmoid, "Output");

let loss_function_instance = Box::new(MeanSquaredErrorLoss);

let mut net = nn.build(loss_function_instance, &"neural-net".to_string());    
net.fit(&x, &y, 10000, 0, 0.01, false, monitor, 1000);
	
let prediction = net.predict(&x).unwrap();
	
println!("\nPredicted value");
prediction.print_matrix();

High-level Overview of the components

  • Rust: Core Tensor abstraction implementing Tensors, numeric abstractions, optimization (gradient descent), neural network primitives, and optional CUDA-backed tensor implementations.
  • CUDA: kernels/ contains CUDA kernels used by the Rust cuda_tensor and gpu_context modules for accelerated matrix ops.
  • Python: python_scripts/ contains helper scripts, experiments and small neural-network examples used for prototyping and data preprocessing.
  • Data/Images: Example JSON metadata and image assets under data/ and image/ used by demos and scripts.

Architecture

Iron Learn System Architecture Architecture diagram generated by Gemini (AI by Google).

The diagram above illustrates the relationship between the high-level Rust API, the Core Tensor abstractions, and the dual-backend (CPU/GPU) execution model.

Quick Module Reference

Tensor API

  • Purpose: Core trait Tensor<T> defines creation (new, zeroes, ones), shape/data accessors (get_shape, get_data), linear algebra (add, sub, mul, t, multiply, div, scale, clip) and reducers (sum). See src/tensor/mod.rs for the trait and basic docs.

NB: The library does not yet support broadcasting. I will soon introduce broadcasting logic.

CpuTensor (CPU backend)

  • Purpose: CpuTensor<T> implements Tensor<T> for CPU operations with plain Rust Vec<T> storage and row-major layout. See src/cpu_tensor/mod.rs.
  • Highlights: explicit shape validation, element-wise math helpers, an inner _cpu_mul implementation optimized for clarity and basic SIMD-friendly loops, and numerically-stable sigmoid in element_op.
  • Limitations: Currently restricted to 1D/2D tensors.

GpuTensor (CUDA backend)

  • Purpose: GpuTensor<T> provides a CUDA-backed Tensor using device buffers, kernels from kernels/gpu_kernels.cu, and optional cublas-accelerated multiplication (_gpu_mul_cublas). See src/cuda_tensor/mod.rs and src/gpu_context.rs for initialization.
  • Highlights: device memory pooling, kernel launches via cust::launch!, functions for elementwise ops (element_op), clipping, transpose (naive), tiled matrix multiply, and column-wise reduce. Uses GPU_CONTEXT to find module/function handles.
  • Notes: GPU ops require init_gpu() to be called and a valid GpuContext. Multiplication defaults to cublasSgemm path when available.
  • Limitations: Currently restricted to 1D/2D single precision floating point tensors. If any other data type is used, the result may not be returned as expected.

GPU Context

  • Purpose: src/gpu_context.rs exposes init_gpu(...) and GPU_CONTEXT (global OnceLock). It stores CUDA Module, Stream, CudaMemoryPool, and a CublasHandle used by GpuTensor.

Gradient Descent / Regression

  • Purpose: src/gradient_descent.rs implements gradient_descent, linear_regression, logistic_regression, and helpers predict_linear, predict_logistic. The functions expect Tensor<f64> with TensorMath<f64> support and follow standard batch gradient updates with optional logistic sigmoid.

Neural Network

  • Purpose: High-level NN abstraction in src/neural_network/. Use NeuralNetBuilder to add Linear layers (LinearLayer) and activations (ActivationLayer) and then build() a NeuralNet instance.
  • Model Persistence: ModelData (see src/neural_network/mod.rs) serializes model metadata:
    • name: model name
    • parameter_count: total parameters
    • layers: list of LayerData objects (layer_type, name, index, weights, shape)
    • epoch: last saved epoch, such that you can resume the Neural Network where is left last.
    • saved_lr: learning rate saved with the model. Right now two learning rate adjustment is supported - None (no change in learning rate) and cos annealing.

CUDA Kernels (quick API map)

  • Located at kernels/gpu_kernels.cu. Main exported kernels (extern "C"):
    • fill_value(float *out, int n, float value) — fill buffer
    • vector_arithmatic(const float *a, const float *b, float *out, int n, unsigned int op) — add/sub/mul/div
    • clip(const float *s, float *r, int n, float min, float max) — clip values
    • element_op(const float *s, float *r, int n, int op, float scale) — exp/sin/cos/tanh/sigmoid/log
    • compare_memory(const float *a, const float *b, size_t size, int *result) — compare arrays
    • transpose_naive(const float *A, float *B, int M, int N) — naive transpose
    • matrix_mul(const float *A, const float *B, float *C, int M, int N, int K) — tiled matmul using shared memory
    • column_reduce(const float *inputMatrix, float *outputSums, int numRows, int numCols) — column sums

Python scripts (where to look)

  • python_scripts/neural_net/* — prototype Python-side neural net builder and helpers used for experiments and model JSON creation.
  • anomaly_detection.py — feature-wise Gaussian estimation and F1-based threshold selection (useful as a standalone script).

Data & Model JSONs

  • Example dataset: see data/image.json (fields: m, n, m_test, x, y, x_test, y_test) — JSON contains flattened row-major arrays for x/y.
  • Example model file: see image/model.json — follows ModelData schema described above. Use read_file::deserialize_model() to load models into ModelData, and NeuralNetBuilder::build_from_model() to restore a runtime NeuralNet from ModelData.

Repository Structure

  • src/ — Rust library and binaries

    • tensor/ — Tensor trait and backend implementations (CPU/GPU implementations live in cpu_tensor/ and cuda_tensor/).
    • neural_network/ — High-level NN builder, layers, activations and loss functions.
    • gradient_descent.rs — CPU implementations of linear and logistic regression and helper routines (normalization, bias handling).
    • gpu_context.rs, cuda_tensor/ — GPU initialization and device-backed tensor types (CUDA interop, cublas wrappers, memory pools).
    • read_file.rs — Helpers for loading JSON model/data artifacts.
    • runners.rs — Small CLI-like routines for running run_linear, run_logistic, and run_neural_net examples.
  • kernels/ — CUDA device code

    • gpu_kernels.cu — Implementations for tiled matrix multiplication, elementwise ops, clipping, transpose and reductions.
    • gpu_kernels.ptx — Precompiled PTX shipped alongside the CUDA source.
  • python_scripts/ — Python utilities and experiments

    • Top-level scripts: k-means.py, check_cuda.py, plot_graph.py, etc.
    • neural_net/ — Small Python builder, activation, layers and helpers used for rapid prototyping and educational examples.
  • data/, image/ — Example datasets, model JSONs and saved weights used by demos.

What each major component does

Rust: src/

  • tensor (trait & implementations): Core abstraction exposing operations such as add, mul, transpose, elementwise math (sin, cos, sigmoid, etc.), and shape/data accessors. Backends implement TensorMath for mathematical functions.
  • cpu_tensor/: Pure CPU tensor implementation used for most algorithms and tests. Most of these are Auto-Vectorized for parallel computation support.
  • cuda_tensor/: CUDA-backed tensor implementation with device buffers and memory pooling. Integrates with gpu_context and uses kernels from kernels/.
  • neural_network/: Provides a NeuralNetBuilder and a runtime NeuralNet type. Layers implement a Layer<T> trait; LinearLayer handles weight matrices and updates while ActivationLayer applies elementwise activations (sigmoid/tanh/linear/sin). loss_functions.rs contains MeanSquaredErrorLoss and BinaryCrossEntropy used for backprop.
  • gradient_descent.rs: Implements linear_regression, logistic_regression, plus helper functions predict_linear, predict_logistic, and gradient_descent steps. Also contains normalize_features exposed from commons.
  • gpu_context.rs and cuda_tensor/*: Manage GPU initialization, cuBLAS/cuDNN handles (where applicable), custom device buffers, and a simple memory pool to reduce allocations when transferring data.
  • runners.rs: Provides convenience runners used by the CLI entrypoints (main.rs) to invoke example training runs.

CUDA: kernels/gpu_kernels.cu

  • Implements atomic-safe memory comparison, element-wise math device kernels (exp, sin, cos, sigmoid branch), vector arithmetic, clipping, tiled matrix multiplication (shared-memory, TILE_SIZE 16), transpose and column reductions. These kernels are used by cuda_tensor for operations like mul, transpose, col_reduce and elementwise transforms (sigmoid, ln, exp).

Python: python_scripts/

  • anomaly_detection.py: Implements Gaussian estimation and threshold selection (F1-based) for anomaly detection — a direct NumPy port of the typical ML course algorithms (estimate Gaussian and select threshold by F1 score).
  • neural_net/ folder: Small Python builder, activation, layers and helpers used for experimentation and generating JSON ModelData artifacts that can be consumed by the Rust read_file/builder logic.

Where to look for functionality you might extend or inspect

  • Implement new layers / activations: src/neural_network/ (add Layer impls and wire into NeuralNetBuilder).
  • Change tensor math: src/tensor/ traits and src/cpu_tensor / src/cuda_tensor implementations.
  • Add GPU kernels: kernels/gpu_kernels.cu and regenerate PTX if required; then extend cuda_tensor to call new kernels.
  • Add CLI runners: src/runners.rs and src/main.rs.

Notes

  • The Rust NeuralNetBuilder supports building from scratch or restoring ModelData (weights loaded from JSON). Layers expose get_parameters() for serialization.
  • ActivationLayer caches outputs; activations (sigmoid, tanh, sin, linear) are implemented in activations.rs using the TensorMath trait.
  • LinearLayer performs forward via input.mul(&self.weights) and backward by computing weights_grad = input.T * error then updating self.weights = self.weights - lr * weights_grad. Used a bias trick to avoid Broadcasting. Will soon add Broadcasting support.
  • loss_functions.rs includes stable BinaryCrossEntropy (clipping predictions to avoid log/divide-by-zero) and an MSE implementation.
  • CUDA kernels implement a tiled matrix multiply and common elementwise ops; the kernels use shared memory and boundary checks for correctness.

Quick pointers for contributors

  • Run tests: cargo test
  • Format: cargo fmt
  • Lint: cargo clippy (may require installing clippy via rustup)
  • Keep Python examples and the Rust model importer (read_file.rs) in sync if you change JSON model formats.

Next steps

  • Add Broadcasting support
  • Axis wise reducer
  • Move to higher dimensions