oxirs-cluster 0.2.4

# GPU Acceleration Setup Guide for OxiRS Cluster

**Version:** 0.2.3
**Last Updated:** 2026-03-05

## Overview

This guide covers setup and configuration of GPU acceleration for OxiRS Cluster, enabling 10-100x performance improvements for machine learning operations, load balancing, and data compression.

## Table of Contents

1. [Hardware Requirements](#hardware-requirements)
2. [NVIDIA CUDA Setup](#nvidia-cuda-setup)
3. [Apple Metal Setup](#apple-metal-setup)
4. [Configuration](#configuration)
5. [Performance Tuning](#performance-tuning)
6. [Troubleshooting](#troubleshooting)

---

## Hardware Requirements

### NVIDIA GPUs (CUDA)

**Minimum Requirements:**
- NVIDIA GPU with Compute Capability 6.0+
- CUDA Toolkit 11.0+
- 2GB VRAM minimum (8GB+ recommended)
- CUDA-compatible driver

**Recommended:**
- NVIDIA A100, V100, or RTX 4090
- CUDA Toolkit 12.0+
- 16GB+ VRAM
- NVLink for multi-GPU setups

**Compatibility Matrix:**

| GPU Model | Compute Capability | Recommended Use Case |
|-----------|-------------------|----------------------|
| RTX 4090 | 8.9 | Development, small-scale production |
| A100 | 8.0 | Large-scale production |
| V100 | 7.0 | Production workloads |
| T4 | 7.5 | Cloud deployments |
| GTX 1080 Ti | 6.1 | Development only |

### Apple Silicon (Metal)

**Supported:**
- M1, M1 Pro, M1 Max, M1 Ultra
- M2, M2 Pro, M2 Max, M2 Ultra
- M3, M3 Pro, M3 Max
- macOS 12.0 (Monterey) or later

**Recommended:**
- M2 Ultra or M3 Max for production
- 32GB+ unified memory
- macOS 13.0 (Ventura) or later

---

## NVIDIA CUDA Setup

### Step 1: Install CUDA Toolkit

**Ubuntu/Debian:**

```bash
# Add NVIDIA repository
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# Install CUDA Toolkit
sudo apt-get install -y cuda-toolkit-12-3

# Install cuDNN (optional, for neural networks)
sudo apt-get install -y libcudnn8 libcudnn8-dev
```

**RHEL/CentOS:**

```bash
# Add NVIDIA repository
sudo dnf config-manager --add-repo \
    https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

# Install CUDA Toolkit
sudo dnf install cuda-toolkit-12-3

# Install cuDNN
sudo dnf install libcudnn8 libcudnn8-devel
```

**Verify Installation:**

```bash
nvidia-smi
nvcc --version
```

Expected output:
```
CUDA Version: 12.3
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03   Driver Version: 535.129.03   CUDA Version: 12.3   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   30C    P0    46W / 400W |      0MiB / 40960MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
```

### Step 2: Configure Environment

Add to `~/.bashrc` or `~/.zshrc`:

```bash
export PATH=/usr/local/cuda-12.3/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.3/lib64:$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda-12.3
```

### Step 3: Build OxiRS with CUDA Support

```bash
cd /path/to/oxirs/storage/oxirs-cluster

# Build with CUDA feature
cargo build --release --features cuda

# Or for development
cargo build --features cuda
```

### Step 4: Verify GPU Acceleration

```rust
use oxirs_cluster::gpu_acceleration::GpuAcceleratedCluster;

#[tokio::main]
async fn main() -> Result<()> {
    let cluster = GpuAcceleratedCluster::new().await?;

    if let Some(backend) = cluster.gpu_backend() {
        println!("GPU Backend: {:?}", backend);
        println!("GPU Available: true");
    } else {
        println!("GPU not available, using CPU fallback");
    }

    Ok(())
}
```

---

## Apple Metal Setup

### Step 1: System Requirements

Ensure you have:
- macOS 12.0 or later
- Xcode Command Line Tools
- Rust toolchain with aarch64-apple-darwin target

### Step 2: Install Xcode Command Line Tools

```bash
xcode-select --install
```

### Step 3: Build OxiRS with Metal Support

```bash
cd /path/to/oxirs/storage/oxirs-cluster

# Build with Metal feature
cargo build --release --features metal

# For development
cargo build --features metal
```

### Step 4: Verify Metal Acceleration

```bash
# Check Metal support
system_profiler SPDisplaysDataType | grep Metal

# Expected output:
# Metal: Supported, feature set macOS GPUFamily2 v1
```

Test in code:

```rust
use oxirs_cluster::gpu_acceleration::{GpuAcceleratedCluster, GpuBackend};

#[tokio::main]
async fn main() -> Result<()> {
    let cluster = GpuAcceleratedCluster::new().await?;

    match cluster.gpu_backend() {
        Some(GpuBackend::Metal) => {
            println!("✅ Metal GPU acceleration enabled");
        }
        Some(GpuBackend::ParallelCpu) => {
            println!("⚠️  GPU not available, using parallel CPU");
        }
        _ => {
            println!("❌ No acceleration available");
        }
    }

    Ok(())
}
```

---

## Configuration

### Basic GPU Configuration

**Cargo.toml:**

```toml
[dependencies]
oxirs-cluster = { version = "0.2.3", features = ["cuda"] }  # Or "metal"

[features]
default = []
cuda = ["oxirs-cluster/cuda"]
metal = ["oxirs-cluster/metal"]
```

### Runtime Configuration

**oxirs.toml:**

```toml
[gpu]
# Enable GPU acceleration
enabled = true

# Backend selection (auto, cuda, metal, cpu)
backend = "auto"

# Memory limit per GPU (MB)
memory_limit_mb = 8192

# Batch size for GPU operations
batch_size = 1000

# Enable mixed precision (FP16/FP32)
mixed_precision = true

# Number of GPUs to use (0 = all available)
gpu_count = 0

[gpu.replica_selection]
# Enable GPU-accelerated replica selection
enabled = true

# Feature weights for scoring
latency_weight = 0.3
connection_weight = 0.2
lag_weight = 0.2
cpu_weight = 0.1
memory_weight = 0.1
success_rate_weight = 0.1

[gpu.load_forecasting]
# Enable GPU-accelerated load forecasting
enabled = true

# History size for time series
history_size = 100

# Forecast horizon
forecast_steps = 10
```

### Programmatic Configuration

```rust
use oxirs_cluster::gpu_acceleration::{
    GpuAcceleratedCluster,
    GpuConfig,
    GpuBackend,
};

let config = GpuConfig {
    backend: GpuBackend::CUDA,
    memory_limit_mb: 8192,
    batch_size: 1000,
    mixed_precision: true,
    device_id: 0,
};

let cluster = GpuAcceleratedCluster::with_config(config).await?;
```

---

## Performance Tuning

### Optimal Batch Sizes

GPU operations have overhead. Use these guidelines for batch sizing:

| Operation | Recommended Batch Size | GPU Threshold |
|-----------|----------------------|---------------|
| Replica Selection | 50-500 | 10+ replicas |
| Load Forecasting | 100-1000 | 24+ datapoints |
| Compression | 1MB-10MB | 1MB+ data |
| Hash Computation | 1000-10000 | 100+ items |

```rust
// Adaptive batching based on data size
let batch_size = if data.len() < 1000 {
    // Use CPU for small batches
    data.len()
} else {
    // Use GPU with optimal batch size
    1000.min(data.len())
};
```

### Memory Management

**Prevent Out-of-Memory Errors:**

```rust
use scirs2_core::gpu::GpuContext;

let ctx = GpuContext::new()?;

// Check available memory before allocation
let available_mb = ctx.available_memory_mb()?;
if available_mb < required_mb {
    return Err(anyhow!("Insufficient GPU memory"));
}

// Allocate with explicit size limits
let buffer = GpuBuffer::with_capacity(&ctx, max_size)?;
```

**Memory Pooling:**

```rust
use scirs2_core::memory::BufferPool;

// Reuse GPU buffers
let pool = BufferPool::new();
let buffer = pool.acquire(size)?;

// ... use buffer ...

// Automatically returned to pool when dropped
```

### Multi-GPU Setup

**Distribute workload across GPUs:**

```rust
use rayon::prelude::*;

let gpu_count = get_gpu_count()?;
let chunks = data.chunks(data.len() / gpu_count);

let results: Vec<_> = chunks
    .enumerate()
    .par_bridge()
    .map(|(gpu_id, chunk)| {
        // Process on specific GPU
        process_on_gpu(gpu_id, chunk)
    })
    .collect();
```

### Benchmark Your Setup

```bash
# Run GPU benchmarks
cargo bench --features cuda --bench cluster_benchmarks

# Specific benchmark
cargo bench --features cuda gpu_replica_selection
```

Expected performance (NVIDIA A100):

```
GPU replica selection/10 replicas   time: [45.2 µs 46.1 µs 47.3 µs]
GPU replica selection/100 replicas  time: [82.5 µs 84.2 µs 86.1 µs]
GPU replica selection/500 replicas  time: [201 µs 205 µs 210 µs]

CPU replica selection/10 replicas   time: [156 µs 159 µs 163 µs]
CPU replica selection/100 replicas  time: [1.85 ms 1.89 ms 1.94 ms]
CPU replica selection/500 replicas  time: [9.12 ms 9.34 ms 9.58 ms]
```

Speedup: **10-46x** for replica selection

---

## Troubleshooting

### Issue: "CUDA driver version is insufficient"

**Cause:** Outdated NVIDIA driver

**Solution:**
```bash
# Check required version
cat /usr/local/cuda/version.txt

# Update driver (Ubuntu)
sudo apt-get install --reinstall nvidia-driver-535

# Reboot
sudo reboot
```

### Issue: "No CUDA-capable device detected"

**Diagnostic:**
```bash
nvidia-smi
lspci | grep -i nvidia
```

**Solutions:**
1. Verify GPU is properly seated
2. Check BIOS settings (PCIe config)
3. Install/reinstall NVIDIA driver
4. Verify GPU is not in compute-prohibited mode

### Issue: "Out of memory" errors

**Solutions:**

1. **Reduce batch size:**
```rust
let config = GpuConfig {
    batch_size: 500,  // Reduce from 1000
    ..Default::default()
};
```

2. **Enable memory pooling:**
```rust
use scirs2_core::memory::GlobalBufferPool;

let pool = GlobalBufferPool::new();
// Buffers automatically reused
```

3. **Clear GPU memory explicitly:**
```rust
cluster.clear_gpu_cache().await?;
```

### Issue: "Kernel launch failed"

**Cause:** Invalid grid/block dimensions or corrupted kernel

**Solutions:**
1. Verify CUDA installation: `nvcc --version`
2. Rebuild with clean: `cargo clean && cargo build --features cuda`
3. Check GPU compute capability matches code
4. Verify GPU is not overheating: `nvidia-smi -q -d TEMPERATURE`

### Issue: Metal acceleration not working on macOS

**Diagnostic:**
```bash
# Check Metal support
system_profiler SPDisplaysDataType | grep Metal

# Check for errors
log show --predicate 'process == "oxirs-cluster"' --last 5m | grep -i metal
```

**Solutions:**
1. Update to latest macOS: `softwareupdate -l`
2. Update Xcode Command Line Tools: `xcode-select --install`
3. Rebuild: `cargo clean && cargo build --features metal`

### Issue: Performance is worse with GPU

**Causes:**
- Batch size too small (overhead dominates)
- Data transfer bottleneck
- CPU-bound operations mixed with GPU

**Solutions:**

1. **Increase batch size:**
```rust
// Only use GPU for large batches
if replicas.len() >= 50 {
    select_replica_gpu(replicas)
} else {
    select_replica_cpu(replicas)
}
```

2. **Profile to find bottleneck:**
```rust
use scirs2_core::profiling::Profiler;

let mut profiler = Profiler::new();
profiler.start("gpu_transfer");
// ... transfer data to GPU ...
profiler.stop("gpu_transfer");

profiler.start("gpu_compute");
// ... GPU computation ...
profiler.stop("gpu_compute");

println!("{:#?}", profiler.generate_report());
```

3. **Use pinned memory for transfers:**
```rust
use scirs2_core::memory_efficient::PinnedMemory;

let pinned = PinnedMemory::new(size)?;
// Faster host-to-device transfers
```

---

## Performance Comparison

### Replica Selection (500 replicas)

| Hardware | Time | Speedup |
|----------|------|---------|
| CPU (Intel Xeon) | 9.34 ms | 1x |
| CPU (Apple M2 Max) | 6.12 ms | 1.5x |
| GPU (NVIDIA T4) | 1.2 ms | 7.8x |
| GPU (NVIDIA A100) | 205 µs | 45.6x |

### Load Forecasting (500 datapoints)

| Hardware | Time | Speedup |
|----------|------|---------|
| CPU | 45.2 ms | 1x |
| GPU (T4) | 8.5 ms | 5.3x |
| GPU (A100) | 3.2 ms | 14.1x |

### Compression (10MB data)

| Hardware | Time | Speedup |
|----------|------|---------|
| CPU (single-threaded) | 125 ms | 1x |
| CPU (8 threads) | 45 ms | 2.8x |
| GPU (T4) | 18 ms | 6.9x |
| GPU (A100) | 9 ms | 13.9x |

---

## Best Practices

### 1. Automatic Fallback

```rust
let cluster = GpuAcceleratedCluster::new().await?;

// Automatically falls back to CPU if GPU unavailable
let result = cluster.select_best_replica(replicas).await?;
```

### 2. Batch Operations

```rust
// ❌ Bad: Individual operations
for replica in replicas {
    process_single(replica).await?;
}

// ✅ Good: Batch processing
process_batch(replicas).await?;
```

### 3. Profile Before Optimizing

```bash
# Benchmark CPU vs GPU
cargo bench --features cuda

# Profile with perf (Linux)
perf record -g cargo run --release --features cuda
perf report
```

### 4. Monitor GPU Utilization

```bash
# Real-time monitoring
watch -n 1 nvidia-smi

# Log to file
nvidia-smi dmon -s mu -o T -f gpu_usage.log
```

### 5. Graceful Degradation

```rust
let gpu_available = check_gpu_availability();

let config = if gpu_available {
    ClusterConfig::with_gpu()
} else {
    ClusterConfig::with_parallel_cpu()
};
```

---

## Additional Resources

- [NVIDIA CUDA Toolkit Documentation](https://docs.nvidia.com/cuda/)
- [Apple Metal Programming Guide](https://developer.apple.com/metal/)
- [SciRS2 GPU Integration Guide](./SCIRS2_INTEGRATION_GUIDE.md)
- [OxiRS Performance Tuning](./PERFORMANCE_TUNING.md)

---

**Last Updated:** 2026-01-06
**Maintainer:** OxiRS Team