Iron Learn
A Rust machine learning library with optional GPU-accelerated optimization. Built for learning tensor operations, gradient-based algorithms, and numerical computing with an emphasis on type safety and correctness.
Features
- GPU-Accelerated Training: CUDA kernels for Tensor operations. Need to explicitly enable the feature. More on this later.
- Comprehensive Tensor Support: Two-dimensional arrays with generic numeric types
- Complex Number Arithmetic: Native support for complex-valued computations
- Zero-Copy Operations: Borrowing methods for efficient computation reuse
Network Output Sample
Let's take a look on what the library can do. In one of the POCs with a ~99K parameter vanilla neural network, I have tried the Universal Approximation Theorem to reconstruct an image. Following are few snaps from the training of a very complex funtion (an image of Simba).
The random noise the network started with

Reconstructed image after 200,000 epochs

Reconstructed image after 800,000 epochs

For comparison, Following is the original image fed to the network.
Original Image

Time lapse of regeneration step by step

You can find all the regenerated images in image/imagesdirectory.
Quick Start
Installation
Add to your Cargo.toml:
[]
= "0.6"
Basic Tensor Usage Example
use CpuTensor;
use Tensor;
// Create 2x2 matrices
let a: = new.unwrap;
let b: = new.unwrap;
println!;
a.print_matrix;
println!;
b.print_matrix;
// Add tensors without move
let sum = a.add.unwrap;
println!;
sum.print_matrix;
// Subtract tensors without move
let sum = a.sub.unwrap;
println!;
sum.print_matrix;
// Multiply tensors (Matrix multiplication)
let product = a.mul.unwrap;
println!;
product.print_matrix;
// Multiply tensors (Element wise multiplication)
let product = a.multiply.unwrap;
println!;
product.print_matrix;
// Divide tensors (Element wise division)
let result = a.div.unwrap;
println!;
result.print_matrix;
// Transpose
let t = a.t.unwrap;
println!;
t.print_matrix;
For detailed example, check out Examples
How to build and run
Prerequisites:
- Rust toolchain (stable or nightly depending on local setup): https://rustup.rs
- For GPU builds: CUDA Toolkit and an NVIDIA GPU (driver + nvcc). If you don't plan to use GPU tensors, CPU-only build is fine.
- Python 3.8+ for the scripts (optional).
cupyis needed for runningpython_scripts.
The library has two modes - CPU and Cuda. Default is CPU, Cuda comes as an optional feature. If you have CUDA environment setup, you can use --features=cuda flag for building and running the code.
Build the basic CPU based Rust library and examples:
Run unit tests:
Run the Rust demonstration runners (examples):
# This is to get a comprehensive documentation of all the CLI Flags
# Run a neural network example
If you want to enable CUDA-backed tensors, ensure your environment has CUDA installed and visible to the linker. Typical workflow is the same cargo build --features=cuda but the code will initialize GPU devices at runtime when init_gpu() or init_context() is invoked.
Python scripts can be run directly:
python_scripts/check_cuda.py is a small utility to detect/validate CUDA availability from Python; useful for quick GPU checks.
Examples
Use GPU Tensor
use init_gpu; // Only available under cuda feature.
use GpuTensor;
use Tensor;
use init_context;
init_gpu;
// Create 2x2 matrices
let a: = new.unwrap;
let b: = new.unwrap;
println!;
a.print_matrix;
println!;
b.print_matrix;
// Add tensors without move
let sum = a.add.unwrap;
println!;
sum.print_matrix;
// Subtract tensors without move
let sum = a.sub.unwrap;
println!;
sum.print_matrix;
// Multiply tensors (Matrix multiplication)
let product = a.mul.unwrap;
println!;
product.print_matrix;
// Multiply tensors (Element wise multiplication)
let product = a.multiply.unwrap;
println!;
product.print_matrix;
// Divide tensors (Element wise division)
let result = a.div.unwrap;
println!;
result.print_matrix;
// Transpose
let t = a.t.unwrap;
println!;
t.print_matrix;
Use Neural Network
use CpuTensor;
use MeanSquaredErrorLoss;
use NeuralNet;
use NeuralNetBuilder;
use DistributionType;
use LayerType;
use Tensor;
let mut nn = new;
let x: = new.unwrap;
let y: = new.unwrap;
let monitor = ;
nn.add_linear;
nn.add_activation;
nn.add_linear;
nn.add_activation;
let loss_function_instance = Boxnew;
let mut net = nn.build;
net.fit;
let prediction = net.predict.unwrap;
println!;
prediction.print_matrix;
High-level Overview of the components
- Rust: Core
Tensorabstraction implementingTensors, numeric abstractions, optimization (gradient descent), neural network primitives, and optional CUDA-backed tensor implementations. - CUDA:
kernels/contains CUDA kernels used by the Rustcuda_tensorandgpu_contextmodules for accelerated matrix ops. - Python:
python_scripts/contains helper scripts, experiments and small neural-network examples used for prototyping and data preprocessing. - Data/Images: Example JSON metadata and image assets under
data/andimage/used by demos and scripts.
Architecture
Architecture diagram generated by Gemini (AI by Google).
The diagram above illustrates the relationship between the high-level Rust API, the Core Tensor abstractions, and the dual-backend (CPU/GPU) execution model.
Quick Module Reference
Tensor API
- Purpose: Core trait
Tensor<T>defines creation (new,zeroes,ones), shape/data accessors (get_shape,get_data), linear algebra (add,sub,mul,t,multiply,div,scale,clip) and reducers (sum). Seesrc/tensor/mod.rsfor the trait and basic docs.
NB: The library does not yet support broadcasting. I will soon introduce broadcasting logic.
CpuTensor (CPU backend)
- Purpose:
CpuTensor<T>implementsTensor<T>for CPU operations with plain RustVec<T>storage and row-major layout. Seesrc/cpu_tensor/mod.rs. - Highlights: explicit shape validation, element-wise math helpers, an inner
_cpu_mulimplementation optimized for clarity and basic SIMD-friendly loops, and numerically-stable sigmoid inelement_op. - Limitations: Currently restricted to 1D/2D tensors.
GpuTensor (CUDA backend)
- Purpose:
GpuTensor<T>provides a CUDA-backedTensorusing device buffers, kernels fromkernels/gpu_kernels.cu, and optional cublas-accelerated multiplication (_gpu_mul_cublas). Seesrc/cuda_tensor/mod.rsandsrc/gpu_context.rsfor initialization. - Highlights: device memory pooling, kernel launches via
cust::launch!, functions for elementwise ops (element_op), clipping, transpose (naive), tiled matrix multiply, and column-wise reduce. UsesGPU_CONTEXTto find module/function handles. - Notes: GPU ops require
init_gpu()to be called and a validGpuContext. Multiplication defaults tocublasSgemmpath when available. - Limitations: Currently restricted to 1D/2D single precision floating point tensors. If any other data type is used, the result may not be returned as expected.
GPU Context
- Purpose:
src/gpu_context.rsexposesinit_gpu(...)andGPU_CONTEXT(global OnceLock). It stores CUDAModule,Stream,CudaMemoryPool, and aCublasHandleused byGpuTensor.
Gradient Descent / Regression
- Purpose:
src/gradient_descent.rsimplementsgradient_descent,linear_regression,logistic_regression, and helperspredict_linear,predict_logistic. The functions expectTensor<f64>withTensorMath<f64>support and follow standard batch gradient updates with optional logistic sigmoid.
Neural Network
- Purpose: High-level NN abstraction in
src/neural_network/. UseNeuralNetBuilderto addLinearlayers (LinearLayer) and activations (ActivationLayer) and thenbuild()aNeuralNetinstance. - Model Persistence:
ModelData(seesrc/neural_network/mod.rs) serializes model metadata:name: model nameparameter_count: total parameterslayers: list ofLayerDataobjects (layer_type,name,index,weights,shape)epoch: last saved epoch, such that you can resume the Neural Network where is left last.saved_lr: learning rate saved with the model. Right now two learning rate adjustment is supported - None (no change in learning rate) and cos annealing.
CUDA Kernels (quick API map)
- Located at
kernels/gpu_kernels.cu. Main exported kernels (extern "C"):fill_value(float *out, int n, float value)— fill buffervector_arithmatic(const float *a, const float *b, float *out, int n, unsigned int op)— add/sub/mul/divclip(const float *s, float *r, int n, float min, float max)— clip valueselement_op(const float *s, float *r, int n, int op, float scale)— exp/sin/cos/tanh/sigmoid/logcompare_memory(const float *a, const float *b, size_t size, int *result)— compare arraystranspose_naive(const float *A, float *B, int M, int N)— naive transposematrix_mul(const float *A, const float *B, float *C, int M, int N, int K)— tiled matmul using shared memorycolumn_reduce(const float *inputMatrix, float *outputSums, int numRows, int numCols)— column sums
Python scripts (where to look)
python_scripts/neural_net/*— prototype Python-side neural net builder and helpers used for experiments and model JSON creation.anomaly_detection.py— feature-wise Gaussian estimation and F1-based threshold selection (useful as a standalone script).
Data & Model JSONs
- Example dataset: see
data/image.json(fields:m,n,m_test,x,y,x_test,y_test) — JSON contains flattened row-major arrays forx/y. - Example model file: see
image/model.json— followsModelDataschema described above. Useread_file::deserialize_model()to load models intoModelData, andNeuralNetBuilder::build_from_model()to restore a runtimeNeuralNetfromModelData.
Repository Structure
-
src/— Rust library and binariestensor/— Tensor trait and backend implementations (CPU/GPU implementations live incpu_tensor/andcuda_tensor/).neural_network/— High-level NN builder, layers, activations and loss functions.gradient_descent.rs— CPU implementations of linear and logistic regression and helper routines (normalization, bias handling).gpu_context.rs,cuda_tensor/— GPU initialization and device-backed tensor types (CUDA interop, cublas wrappers, memory pools).read_file.rs— Helpers for loading JSON model/data artifacts.runners.rs— Small CLI-like routines for runningrun_linear,run_logistic, andrun_neural_netexamples.
-
kernels/— CUDA device codegpu_kernels.cu— Implementations for tiled matrix multiplication, elementwise ops, clipping, transpose and reductions.gpu_kernels.ptx— Precompiled PTX shipped alongside the CUDA source.
-
python_scripts/— Python utilities and experiments- Top-level scripts:
k-means.py,check_cuda.py,plot_graph.py, etc. neural_net/— Small Python builder, activation, layers and helpers used for rapid prototyping and educational examples.
- Top-level scripts:
-
data/,image/— Example datasets, model JSONs and saved weights used by demos.
What each major component does
Rust: src/
tensor(trait & implementations): Core abstraction exposing operations such asadd,mul,transpose, elementwise math (sin,cos,sigmoid, etc.), and shape/data accessors. Backends implementTensorMathfor mathematical functions.cpu_tensor/: Pure CPU tensor implementation used for most algorithms and tests. Most of these are Auto-Vectorized for parallel computation support.cuda_tensor/: CUDA-backed tensor implementation with device buffers and memory pooling. Integrates withgpu_contextand uses kernels fromkernels/.neural_network/: Provides aNeuralNetBuilderand a runtimeNeuralNettype. Layers implement aLayer<T>trait;LinearLayerhandles weight matrices and updates whileActivationLayerapplies elementwise activations (sigmoid/tanh/linear/sin).loss_functions.rscontainsMeanSquaredErrorLossandBinaryCrossEntropyused for backprop.gradient_descent.rs: Implementslinear_regression,logistic_regression, plus helper functionspredict_linear,predict_logistic, andgradient_descentsteps. Also containsnormalize_featuresexposed fromcommons.gpu_context.rsandcuda_tensor/*: Manage GPU initialization, cuBLAS/cuDNN handles (where applicable), custom device buffers, and a simple memory pool to reduce allocations when transferring data.runners.rs: Provides convenience runners used by the CLI entrypoints (main.rs) to invoke example training runs.
CUDA: kernels/gpu_kernels.cu
- Implements atomic-safe memory comparison, element-wise math device kernels (exp, sin, cos, sigmoid branch), vector arithmetic, clipping, tiled matrix multiplication (shared-memory, TILE_SIZE 16), transpose and column reductions. These kernels are used by
cuda_tensorfor operations likemul,transpose,col_reduceand elementwise transforms (sigmoid, ln, exp).
Python: python_scripts/
anomaly_detection.py: Implements Gaussian estimation and threshold selection (F1-based) for anomaly detection — a direct NumPy port of the typical ML course algorithms (estimate Gaussian and select threshold by F1 score).neural_net/folder: Small Python builder, activation, layers and helpers used for experimentation and generating JSONModelDataartifacts that can be consumed by the Rustread_file/builder logic.
Where to look for functionality you might extend or inspect
- Implement new layers / activations:
src/neural_network/(addLayerimpls and wire intoNeuralNetBuilder). - Change tensor math:
src/tensor/traits andsrc/cpu_tensor/src/cuda_tensorimplementations. - Add GPU kernels:
kernels/gpu_kernels.cuand regenerate PTX if required; then extendcuda_tensorto call new kernels. - Add CLI runners:
src/runners.rsandsrc/main.rs.
Notes
- The Rust
NeuralNetBuildersupports building from scratch or restoringModelData(weights loaded from JSON). Layers exposeget_parameters()for serialization. ActivationLayercaches outputs; activations (sigmoid, tanh, sin, linear) are implemented inactivations.rsusing theTensorMathtrait.LinearLayerperforms forward viainput.mul(&self.weights)and backward by computingweights_grad = input.T * errorthen updatingself.weights = self.weights - lr * weights_grad. Used a bias trick to avoid Broadcasting. Will soon add Broadcasting support.loss_functions.rsincludes stable BinaryCrossEntropy (clipping predictions to avoid log/divide-by-zero) and an MSE implementation.- CUDA kernels implement a tiled matrix multiply and common elementwise ops; the kernels use shared memory and boundary checks for correctness.
Quick pointers for contributors
- Run tests:
cargo test - Format:
cargo fmt - Lint:
cargo clippy(may require installingclippyvia rustup) - Keep Python examples and the Rust model importer (
read_file.rs) in sync if you change JSON model formats.
Next steps
- Add Broadcasting support
- Axis wise reducer
- Move to higher dimensions