# GPU Acceleration Setup Guide for OxiRS Cluster
**Version:** 0.2.3
**Last Updated:** 2026-03-05
## Overview
This guide covers setup and configuration of GPU acceleration for OxiRS Cluster, enabling 10-100x performance improvements for machine learning operations, load balancing, and data compression.
## Table of Contents
1. [Hardware Requirements](#hardware-requirements)
2. [NVIDIA CUDA Setup](#nvidia-cuda-setup)
3. [Apple Metal Setup](#apple-metal-setup)
4. [Configuration](#configuration)
5. [Performance Tuning](#performance-tuning)
6. [Troubleshooting](#troubleshooting)
---
## Hardware Requirements
### NVIDIA GPUs (CUDA)
**Minimum Requirements:**
- NVIDIA GPU with Compute Capability 6.0+
- CUDA Toolkit 11.0+
- 2GB VRAM minimum (8GB+ recommended)
- CUDA-compatible driver
**Recommended:**
- NVIDIA A100, V100, or RTX 4090
- CUDA Toolkit 12.0+
- 16GB+ VRAM
- NVLink for multi-GPU setups
**Compatibility Matrix:**
| RTX 4090 | 8.9 | Development, small-scale production |
| A100 | 8.0 | Large-scale production |
| V100 | 7.0 | Production workloads |
| T4 | 7.5 | Cloud deployments |
| GTX 1080 Ti | 6.1 | Development only |
### Apple Silicon (Metal)
**Supported:**
- M1, M1 Pro, M1 Max, M1 Ultra
- M2, M2 Pro, M2 Max, M2 Ultra
- M3, M3 Pro, M3 Max
- macOS 12.0 (Monterey) or later
**Recommended:**
- M2 Ultra or M3 Max for production
- 32GB+ unified memory
- macOS 13.0 (Ventura) or later
---
## NVIDIA CUDA Setup
### Step 1: Install CUDA Toolkit
**Ubuntu/Debian:**
```bash
# Add NVIDIA repository
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
# Install CUDA Toolkit
sudo apt-get install -y cuda-toolkit-12-3
# Install cuDNN (optional, for neural networks)
sudo apt-get install -y libcudnn8 libcudnn8-dev
```
**RHEL/CentOS:**
```bash
# Add NVIDIA repository
sudo dnf config-manager --add-repo \
https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
# Install CUDA Toolkit
sudo dnf install cuda-toolkit-12-3
# Install cuDNN
sudo dnf install libcudnn8 libcudnn8-devel
```
**Verify Installation:**
```bash
nvidia-smi
nvcc --version
```
Expected output:
```
CUDA Version: 12.3
+-----------------------------------------------------------------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:00:1E.0 Off | 0 |
| N/A 30C P0 46W / 400W | 0MiB / 40960MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
```
### Step 2: Configure Environment
Add to `~/.bashrc` or `~/.zshrc`:
```bash
export PATH=/usr/local/cuda-12.3/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.3/lib64:$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda-12.3
```
### Step 3: Build OxiRS with CUDA Support
```bash
cd /path/to/oxirs/storage/oxirs-cluster
# Build with CUDA feature
cargo build --release --features cuda
# Or for development
cargo build --features cuda
```
### Step 4: Verify GPU Acceleration
```rust
use oxirs_cluster::gpu_acceleration::GpuAcceleratedCluster;
#[tokio::main]
async fn main() -> Result<()> {
let cluster = GpuAcceleratedCluster::new().await?;
if let Some(backend) = cluster.gpu_backend() {
println!("GPU Backend: {:?}", backend);
println!("GPU Available: true");
} else {
println!("GPU not available, using CPU fallback");
}
Ok(())
}
```
---
## Apple Metal Setup
### Step 1: System Requirements
Ensure you have:
- macOS 12.0 or later
- Xcode Command Line Tools
- Rust toolchain with aarch64-apple-darwin target
### Step 2: Install Xcode Command Line Tools
```bash
xcode-select --install
```
### Step 3: Build OxiRS with Metal Support
```bash
cd /path/to/oxirs/storage/oxirs-cluster
# Build with Metal feature
cargo build --release --features metal
# For development
cargo build --features metal
```
### Step 4: Verify Metal Acceleration
```bash
# Check Metal support
system_profiler SPDisplaysDataType | grep Metal
# Expected output:
# Metal: Supported, feature set macOS GPUFamily2 v1
```
Test in code:
```rust
use oxirs_cluster::gpu_acceleration::{GpuAcceleratedCluster, GpuBackend};
#[tokio::main]
async fn main() -> Result<()> {
let cluster = GpuAcceleratedCluster::new().await?;
match cluster.gpu_backend() {
Some(GpuBackend::Metal) => {
println!("✅ Metal GPU acceleration enabled");
}
Some(GpuBackend::ParallelCpu) => {
println!("⚠️ GPU not available, using parallel CPU");
}
_ => {
println!("❌ No acceleration available");
}
}
Ok(())
}
```
---
## Configuration
### Basic GPU Configuration
**Cargo.toml:**
```toml
[dependencies]
oxirs-cluster = { version = "0.2.3", features = ["cuda"] } # Or "metal"
[features]
default = []
cuda = ["oxirs-cluster/cuda"]
metal = ["oxirs-cluster/metal"]
```
### Runtime Configuration
**oxirs.toml:**
```toml
[gpu]
# Enable GPU acceleration
enabled = true
# Backend selection (auto, cuda, metal, cpu)
backend = "auto"
# Memory limit per GPU (MB)
memory_limit_mb = 8192
# Batch size for GPU operations
batch_size = 1000
# Enable mixed precision (FP16/FP32)
mixed_precision = true
# Number of GPUs to use (0 = all available)
gpu_count = 0
[gpu.replica_selection]
# Enable GPU-accelerated replica selection
enabled = true
# Feature weights for scoring
latency_weight = 0.3
connection_weight = 0.2
lag_weight = 0.2
cpu_weight = 0.1
memory_weight = 0.1
success_rate_weight = 0.1
[gpu.load_forecasting]
# Enable GPU-accelerated load forecasting
enabled = true
# History size for time series
history_size = 100
# Forecast horizon
forecast_steps = 10
```
### Programmatic Configuration
```rust
use oxirs_cluster::gpu_acceleration::{
GpuAcceleratedCluster,
GpuConfig,
GpuBackend,
};
let config = GpuConfig {
backend: GpuBackend::CUDA,
memory_limit_mb: 8192,
batch_size: 1000,
mixed_precision: true,
device_id: 0,
};
let cluster = GpuAcceleratedCluster::with_config(config).await?;
```
---
## Performance Tuning
### Optimal Batch Sizes
GPU operations have overhead. Use these guidelines for batch sizing:
| Replica Selection | 50-500 | 10+ replicas |
| Load Forecasting | 100-1000 | 24+ datapoints |
| Compression | 1MB-10MB | 1MB+ data |
| Hash Computation | 1000-10000 | 100+ items |
```rust
// Adaptive batching based on data size
let batch_size = if data.len() < 1000 {
// Use CPU for small batches
data.len()
} else {
// Use GPU with optimal batch size
1000.min(data.len())
};
```
### Memory Management
**Prevent Out-of-Memory Errors:**
```rust
use scirs2_core::gpu::GpuContext;
let ctx = GpuContext::new()?;
// Check available memory before allocation
let available_mb = ctx.available_memory_mb()?;
if available_mb < required_mb {
return Err(anyhow!("Insufficient GPU memory"));
}
// Allocate with explicit size limits
let buffer = GpuBuffer::with_capacity(&ctx, max_size)?;
```
**Memory Pooling:**
```rust
use scirs2_core::memory::BufferPool;
// Reuse GPU buffers
let pool = BufferPool::new();
let buffer = pool.acquire(size)?;
// ... use buffer ...
// Automatically returned to pool when dropped
```
### Multi-GPU Setup
**Distribute workload across GPUs:**
```rust
use rayon::prelude::*;
let gpu_count = get_gpu_count()?;
let chunks = data.chunks(data.len() / gpu_count);
let results: Vec<_> = chunks
.enumerate()
.par_bridge()
.map(|(gpu_id, chunk)| {
// Process on specific GPU
process_on_gpu(gpu_id, chunk)
})
.collect();
```
### Benchmark Your Setup
```bash
# Run GPU benchmarks
cargo bench --features cuda --bench cluster_benchmarks
# Specific benchmark
cargo bench --features cuda gpu_replica_selection
```
Expected performance (NVIDIA A100):
```
GPU replica selection/10 replicas time: [45.2 µs 46.1 µs 47.3 µs]
GPU replica selection/100 replicas time: [82.5 µs 84.2 µs 86.1 µs]
GPU replica selection/500 replicas time: [201 µs 205 µs 210 µs]
CPU replica selection/10 replicas time: [156 µs 159 µs 163 µs]
CPU replica selection/100 replicas time: [1.85 ms 1.89 ms 1.94 ms]
CPU replica selection/500 replicas time: [9.12 ms 9.34 ms 9.58 ms]
```
Speedup: **10-46x** for replica selection
---
## Troubleshooting
### Issue: "CUDA driver version is insufficient"
**Cause:** Outdated NVIDIA driver
**Solution:**
```bash
# Check required version
cat /usr/local/cuda/version.txt
# Update driver (Ubuntu)
sudo apt-get install --reinstall nvidia-driver-535
# Reboot
sudo reboot
```
### Issue: "No CUDA-capable device detected"
**Diagnostic:**
```bash
nvidia-smi
**Solutions:**
1. Verify GPU is properly seated
2. Check BIOS settings (PCIe config)
3. Install/reinstall NVIDIA driver
4. Verify GPU is not in compute-prohibited mode
### Issue: "Out of memory" errors
**Solutions:**
1. **Reduce batch size:**
```rust
let config = GpuConfig {
batch_size: 500, ..Default::default()
};
```
2. **Enable memory pooling:**
```rust
use scirs2_core::memory::GlobalBufferPool;
let pool = GlobalBufferPool::new();
```
3. **Clear GPU memory explicitly:**
```rust
cluster.clear_gpu_cache().await?;
```
### Issue: "Kernel launch failed"
**Cause:** Invalid grid/block dimensions or corrupted kernel
**Solutions:**
1. Verify CUDA installation: `nvcc --version`
2. Rebuild with clean: `cargo clean && cargo build --features cuda`
3. Check GPU compute capability matches code
4. Verify GPU is not overheating: `nvidia-smi -q -d TEMPERATURE`
### Issue: Metal acceleration not working on macOS
**Diagnostic:**
```bash
# Check Metal support
system_profiler SPDisplaysDataType | grep Metal
# Check for errors
**Solutions:**
1. Update to latest macOS: `softwareupdate -l`
2. Update Xcode Command Line Tools: `xcode-select --install`
3. Rebuild: `cargo clean && cargo build --features metal`
### Issue: Performance is worse with GPU
**Causes:**
- Batch size too small (overhead dominates)
- Data transfer bottleneck
- CPU-bound operations mixed with GPU
**Solutions:**
1. **Increase batch size:**
```rust
if replicas.len() >= 50 {
select_replica_gpu(replicas)
} else {
select_replica_cpu(replicas)
}
```
2. **Profile to find bottleneck:**
```rust
use scirs2_core::profiling::Profiler;
let mut profiler = Profiler::new();
profiler.start("gpu_transfer");
profiler.stop("gpu_transfer");
profiler.start("gpu_compute");
profiler.stop("gpu_compute");
println!("{:#?}", profiler.generate_report());
```
3. **Use pinned memory for transfers:**
```rust
use scirs2_core::memory_efficient::PinnedMemory;
let pinned = PinnedMemory::new(size)?;
```
---
## Performance Comparison
### Replica Selection (500 replicas)
| CPU (Intel Xeon) | 9.34 ms | 1x |
| CPU (Apple M2 Max) | 6.12 ms | 1.5x |
| GPU (NVIDIA T4) | 1.2 ms | 7.8x |
| GPU (NVIDIA A100) | 205 µs | 45.6x |
### Load Forecasting (500 datapoints)
| CPU | 45.2 ms | 1x |
| GPU (T4) | 8.5 ms | 5.3x |
| GPU (A100) | 3.2 ms | 14.1x |
### Compression (10MB data)
| CPU (single-threaded) | 125 ms | 1x |
| CPU (8 threads) | 45 ms | 2.8x |
| GPU (T4) | 18 ms | 6.9x |
| GPU (A100) | 9 ms | 13.9x |
---
## Best Practices
### 1. Automatic Fallback
```rust
let cluster = GpuAcceleratedCluster::new().await?;
// Automatically falls back to CPU if GPU unavailable
let result = cluster.select_best_replica(replicas).await?;
```
### 2. Batch Operations
```rust
// ❌ Bad: Individual operations
for replica in replicas {
process_single(replica).await?;
}
// ✅ Good: Batch processing
process_batch(replicas).await?;
```
### 3. Profile Before Optimizing
```bash
# Benchmark CPU vs GPU
cargo bench --features cuda
# Profile with perf (Linux)
perf record -g cargo run --release --features cuda
perf report
```
### 4. Monitor GPU Utilization
```bash
# Real-time monitoring
watch -n 1 nvidia-smi
# Log to file
nvidia-smi dmon -s mu -o T -f gpu_usage.log
```
### 5. Graceful Degradation
```rust
let gpu_available = check_gpu_availability();
let config = if gpu_available {
ClusterConfig::with_gpu()
} else {
ClusterConfig::with_parallel_cpu()
};
```
---
## Additional Resources
- [NVIDIA CUDA Toolkit Documentation](https://docs.nvidia.com/cuda/)
- [Apple Metal Programming Guide](https://developer.apple.com/metal/)
- [SciRS2 GPU Integration Guide](./SCIRS2_INTEGRATION_GUIDE.md)
- [OxiRS Performance Tuning](./PERFORMANCE_TUNING.md)
---
**Last Updated:** 2026-01-06
**Maintainer:** OxiRS Team