Volta ⚡
A PyTorch-like deep learning framework in pure Rust
Volta is a minimal deep learning and automatic differentiation library built from scratch in pure Rust, heavily inspired by PyTorch. It provides a dynamic computation graph, NumPy-style broadcasting, and common neural network primitives.
This project is an educational endeavor to demystify the inner workings of modern autograd engines. It prioritizes correctness, clarity, and a clean API over raw performance, while still providing hooks for hardware acceleration.
Key Features
- Dynamic Computation Graph: Build and backpropagate through graphs on the fly, just like PyTorch.
- Reverse-Mode Autodiff: Efficient reverse-mode automatic differentiation with topological sorting.
- Rich Tensor Operations: A comprehensive set of unary, binary, reduction, and matrix operations via an ergonomic
TensorOpstrait. - Broadcasting: Full NumPy-style broadcasting support for arithmetic operations.
- Neural Network Layers:
Linear,Conv2d,ConvTranspose2d,MaxPool2d,Embedding,LSTMCell,PixelShuffle,Flatten,ReLU,Sigmoid,Tanh,Dropout,BatchNorm1d,BatchNorm2d. - Optimizers:
SGD(momentum + weight decay),Adam(bias-corrected + weight decay), and experimentalMuon. - External Model Loading: Load weights from PyTorch, HuggingFace, and other frameworks via
StateDictMapperwith automatic weight transposition and key remapping. Supports SafeTensors format. - Named Layers: Human-readable state dict keys with
Sequential::builder()pattern for robust serialization. - Multi-dtype Support: Initial support for f16, bf16, f32, f64, i32, i64, u8, and bool tensors.
- IO System: Save and load model weights (state dicts) via
bincodeor SafeTensors format. - BLAS Acceleration (macOS): Optional acceleration for matrix multiplication via Apple's Accelerate framework.
- GPU Acceleration: Experimental WGPU-based GPU support for core tensor operations (elementwise, matmul, reductions, movement ops) with automatic backward pass on GPU.
- Validation-Focused: Includes a robust numerical gradient checker to ensure the correctness of all implemented operations.
Project Status
This library is functional for training MLPs, CNNs, RNNs, GANs, VAEs, and other architectures on CPU. It features a verified autograd engine and correctly implemented im2col convolutions.
-
✅ What's Working:
- Core Autograd: All operations verified with numerical gradient checking
- Layers: Linear, Conv2d, ConvTranspose2d, MaxPool2d, Embedding, LSTMCell, PixelShuffle, BatchNorm1d/2d, Dropout
- Optimizers: SGD (with momentum), Adam, Muon
- External Loading: PyTorch/HuggingFace model weights via SafeTensors with automatic transposition
- Named Layers: Robust serialization with human-readable state dict keys
- Loss Functions: MSE, Cross-Entropy, NLL, BCE, KL Divergence
- Examples: MNIST, CIFAR, character LM, VAE, DCGAN, super-resolution, LSTM time series
- GPU Training Pipeline: GPU-accelerated forward pass for Conv2d with device-aware layers and GPU optimizer state storage
- Benchmarking Suite: Comprehensive Criterion benchmarks with 3 categories (tensor_ops, neural_networks, gpu_comparison) and HTML reports
- Enhanced GPU Safety: GPU buffer pooling, command queue throttling, CPU cache invalidation, and early warning system
- Code Quality: All
indexing_slicingclippy errors resolved; ~400+ pedantic lints reduced to ~223 remaining
-
⚠️ What's in Progress:
- Performance: Comprehensive benchmarking suite for performance tracking with
just benchcommands - GPU Support: Experimental WGPU-based acceleration via
gpufeature:- ✅ Core ops on GPU: elementwise (unary/binary), matmul, reductions (sum/max/mean), movement ops (permute/expand/pad/shrink/stride)
- ✅ GPU backward pass for autograd with lazy CPU↔GPU transfers
- ✅ GPU-accelerated forward pass implemented for Conv2d
- ⚠️ Neural network layer backward passes still being ported to GPU
- ⚠️ Broadcasting preprocessing happens on CPU before GPU dispatch
- Performance: Comprehensive benchmarking suite for performance tracking with
-
❌ What's Missing:
- Production-ready GPU integration, distributed training, learning-rate schedulers, attention/transformer layers
Installation
Add Volta to your Cargo.toml:
[]
= "0.3.0"
Enabling BLAS on macOS
For a significant performance boost in matrix multiplication on macOS, enable the accelerate feature:
[]
= { = "0.3.0", = ["accelerate"] }
Enabling GPU Support
For experimental GPU acceleration via WGPU, enable the gpu feature:
[]
= { = "0.3.0", = ["gpu"] }
Or combine both for maximum performance:
[]
= { = "0.3.0", = ["accelerate", "gpu"] }
Examples:
Training an MLP
Here's how to define a simple Multi-Layer Perceptron (MLP) with named layers, train it on synthetic data, and save the model.
use ;
LeNet-style CNN training on CPU
The following utilizes the current API to define a training-ready CNN.
use ;
use Module;
use TensorOps;
Loading External PyTorch Models
Volta can load weights from PyTorch, HuggingFace, and other frameworks using SafeTensors format with automatic weight mapping and transposition.
use ;
Weight Mapping Features:
rename(from, to)- Rename individual keysrename_prefix(old, new)- Rename all keys with prefixstrip_prefix(prefix)- Remove prefix from keystranspose(key)- Transpose 2D weight matrices (PyTorch compatibility)transpose_pattern(pattern)- Transpose all matching keysselect_keys(keys)/exclude_keys(keys)- Filter state dict
See examples/load_external_mnist.rs for a complete end-to-end example with validation.
GPU Acceleration Example
use ;
API Overview
The library is designed around a few core concepts:
Tensor: The central data structure, anRc<RefCell<RawTensor>>, which holds data, shape, gradient information, and device location. Supports multiple data types (f32, f16, bf16, f64, i32, i64, u8, bool).TensorOps: A trait implemented forTensorthat provides the ergonomic, user-facing API for all operations (e.g.,tensor.add(&other),tensor.matmul(&weights)).nn::Module: A trait for building neural network layers and composing them into larger models. Providesforward(),parameters(),state_dict(),load_state_dict(), andto_device()methods.Sequential::builder(): Builder pattern for composing layers with named parameters for robust serialization. Supports bothadd_named()for human-readable state dict keys andadd_unnamed()for activation layers.- Optimizers (
Adam,SGD,Muon): Structures that take a list of model parameters and update their weights based on computed gradients duringstep(). Device: Abstraction for CPU/GPU compute. Tensors can be moved between devices withto_device(), and operations automatically dispatch to GPU kernels when available.- External Model Loading:
StateDictMapperprovides transformations (rename, transpose, prefix handling) to load weights from PyTorch, HuggingFace, and other frameworks via SafeTensors format. - Vision Support:
Conv2d,ConvTranspose2d(for GANs/VAEs),MaxPool2d,PixelShuffle(for super-resolution),BatchNorm1d/2d, andDropout. - Sequence Support:
Embeddinglayers for discrete inputs,LSTMCellfor recurrent architectures.
Running the Test Suite
Volta has an extensive test suite that validates the correctness of every operation and its gradient. To run the tests:
To run tests with BLAS acceleration enabled (on macOS):
To run tests with GPU support:
Run specific test categories:
Available Examples
The examples/ directory contains complete working examples demonstrating various capabilities:
# Basic examples
# Vision tasks
# Generative models
# Sequence models
# External model loading
# GPU acceleration
# Regression
Roadmap
The next major steps for Volta are focused on expanding its capabilities to handle more complex models and improving performance.
- Complete GPU Integration: Port remaining neural network layers (Linear, Conv2d) to GPU, optimize GEMM kernels with shared memory tiling.
- Performance Optimization: Implement SIMD for element-wise operations, optimize broadcasting on GPU, kernel fusion for composite operations.
- Transformer Support: Add attention mechanisms, positional encodings, layer normalization.
- Learning Rate Schedulers: Cosine annealing, step decay, warmup schedules.
Outstanding Issues
- Conv2d Memory Inefficiency:
im2colimplementation insrc/nn/layers/conv.rsmaterializes the entire matrix in memory. Large batch sizes or high-resolution images will easily OOM even on high-end machines. - GPU Kernel Efficiency: Current GPU matmul uses naive implementation without shared memory tiling. Significant performance gains possible with optimized GEMM kernels.
- Multi-dtype Completeness: While storage supports multiple dtypes (f16, bf16, f64, etc.), most operations still assume f32. Full dtype support requires operation kernels for each type.
- Single-threaded: Uses
Rc<RefCell>instead ofArc<Mutex>, limiting to single-threaded execution on CPU.
Contributing
Contributions, issues, and feature requests are welcome! Feel free to check the issues page.
License
This project is licensed under the MIT License - see the LICENSE file for details.