unsloth-rs 1.0.1

Rust implementations of transformer building blocks for LLM inference
Documentation
# GPU Development Setup Guide

**Last Updated**: 2026-01-10  
**Target GPU**: NVIDIA GeForce RTX 5080 (Compute Capability 12.0)

## Current Status

✅ **GPU Available**: RTX 5080 detected  
✅ **NVIDIA Driver**: 590.48.01 installed  
❌ **CUDA Toolkit**: Not installed (required for compilation)

## Prerequisites

### Check GPU Status

```bash
nvidia-smi
# Should show: NVIDIA GeForce RTX 5080, 16303 MiB
```

### Check Driver Version

```bash
nvidia-smi --query-gpu=driver_version --format=csv,noheader
# Should show: 590.48.01 or newer
```

## Installing CUDA Toolkit

The CUDA toolkit is required to compile GPU kernels. It includes `nvcc` (NVIDIA C++ Compiler) and CUDA libraries.

### Option 1: System Package Manager (Recommended for Ubuntu/Debian)

```bash
# Add NVIDIA package repository (if not already added)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# Install CUDA Toolkit 12.x (matches RTX 5080 Compute Capability 12.0)
sudo apt-get install -y cuda-toolkit-12-6

# Add to PATH
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

# Verify installation
nvcc --version
```

### Option 2: NVIDIA Runfile Installer

```bash
# Download CUDA 12.6 runfile
wget https://developer.download.nvidia.com/compute/cuda/12.6.0/local_installers/cuda_12.6.0_535.183.01_linux.run

# Install (driver already installed, so skip driver)
sudo sh cuda_12.6.0_535.183.01_linux.run --toolkit --silent

# Add to PATH
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
```

### Option 3: Conda Environment

```bash
conda create -n cuda-env python=3.11
conda activate cuda-env
conda install -c nvidia cuda-toolkit=12.6
```

## Verifying Installation

After installing CUDA toolkit:

```bash
# Check nvcc is available
nvcc --version

# Should show something like:
# nvcc: NVIDIA (R) Cuda compiler driver
# Copyright (c) 2005-2024 NVIDIA Corporation
# Build cuda_12.6.r12.6/...

# Check CUDA libraries
ls /usr/local/cuda/lib64/libcudart.so*

# Test compilation
cd ~/Documents/projects/rust-ai/unsloth-rs
cargo build --features cuda
```

## Building with CUDA Support

Once CUDA toolkit is installed:

```bash
# Build with CUDA feature
cargo build --features cuda

# Run tests with CUDA
cargo test --features cuda

# Run benchmarks with CUDA
cargo bench --features cuda

# Profile with CUDA
./scripts/profile-gpu.sh all
```

## Troubleshooting

### Error: `nvcc` not found

**Problem**: CUDA toolkit not installed or not in PATH

**Solution**:
```bash
# Check if nvcc exists
which nvcc

# If not found, add to PATH
export PATH=/usr/local/cuda/bin:$PATH
```

### Error: Cannot find `-lcudart`

**Problem**: CUDA libraries not in library path

**Solution**:
```bash
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
```

### Error: Compute capability not supported

**Problem**: Installed CUDA version too old for RTX 5080

**Solution**: Install CUDA 12.0 or newer

### Error: Out of memory

**Problem**: Insufficient VRAM for operation

**Solution**:
- Reduce batch size in tests/benchmarks
- Use smaller sequence lengths
- Enable gradient checkpointing

## Environment Variables

For convenience, add these to `~/.bashrc` or `~/.zshrc`:

```bash
# CUDA Toolkit paths
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

# Optional: Specify CUDA version for cudarc
export CUDARC_CUDA_VERSION=12.6

# Optional: Enable CUDA debugging
export CUDA_LAUNCH_BLOCKING=1  # For debugging only (slower)
export RUST_BACKTRACE=1
```

## Next Steps After Setup

Once CUDA toolkit is installed:

1. **Run GPU Tests**:
   ```bash
   cargo test --features cuda
   ```

2. **Run GPU Profiling**:
   ```bash
   ./scripts/profile-gpu.sh flash-attention
   ```

3. **Compare CPU vs GPU**:
   ```bash
   ./scripts/profile-gpu.sh compare
   ```

4. **Full Profiling Suite**:
   ```bash
   ./scripts/profile-gpu.sh all
   ```

## Disk Space Requirements

- CUDA Toolkit 12.6: ~3-4 GB
- Rust build artifacts (with CUDA): ~5-10 GB
- Total recommended: 20 GB free space

## Performance Monitoring Tools

### NVIDIA Nsight Systems (Optional but Recommended)

For detailed kernel profiling:

```bash
# Download from NVIDIA website
wget https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2024_6/nsight-systems-2024.6.1_2024.6.1.92-1_amd64.deb

# Install
sudo dpkg -i nsight-systems-*.deb

# Profile with nsys
./scripts/profile-gpu.sh nsys

# View results
nsys-ui target/profiling/nsys_profile.nsys-rep
```

### NVIDIA Nsight Compute (Optional)

For detailed kernel analysis:

```bash
# Download from NVIDIA website
wget https://developer.nvidia.com/downloads/assets/tools/secure/nsight-compute/2024_4/ncu_2024.4.0_linux.run

# Install
sudo sh ncu_2024.4.0_linux.run --silent

# Profile specific kernel
ncu --target-processes all cargo bench --features cuda
```

## CI/CD Considerations

GitHub Actions runners do not have GPU access. Per project design:
- GPU tests/benchmarks run **locally only**
- CI runs CPU-only tests
- Use `./scripts/local-build.sh cuda` for local GPU builds
- Results documented in BENCHMARKING.md

## References

- [CUDA Installation Guide]https://docs.nvidia.com/cuda/cuda-installation-guide-linux/
- [RTX 5080 Specifications]https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/
- [CubeCL Documentation]https://github.com/tracel-ai/cubecl
- [Candle CUDA Backend]https://github.com/huggingface/candle/tree/main/candle-cuda

---

**Status**: Driver installed, toolkit installation needed  
**Next**: Install CUDA toolkit, then run GPU profiling