ruviz 0.4.2 - Docs.rs

# GPU Alignment Bug - FIXED

**Status:** IMPLEMENTED (Fix 1 applied)

---

# GPU Alignment Bug - Original Proposed Fixes

## Bug Summary

**Location:** `src/render/gpu/memory.rs:308`
**Symptom:** `cast_slice>TargetAlignmentGreaterAndInputNotAligned` panic
**Cause:** `BufferView` from wgpu mapping may not be aligned for target type

---

## Fix 1: Use `try_cast_slice` with Manual Copy Fallback (Recommended)

**Approach:** Try zero-copy cast first, fall back to byte-by-byte copy if unaligned.

```rust
// In memory.rs, replace line 308:

// OLD (panics on unaligned):
let result_data: Vec<T> = cast_slice(&mapped_data[..element_count * element_size]).to_vec();

// NEW (safe with fallback):
let byte_slice = &mapped_data[..element_count * element_size];
let result_data: Vec<T> = match bytemuck::try_cast_slice::<u8, T>(byte_slice) {
    Ok(aligned_slice) => {
        // Fast path: data is aligned, zero-copy cast
        aligned_slice.to_vec()
    }
    Err(_) => {
        // Slow path: unaligned, copy byte-by-byte
        let mut result = Vec::with_capacity(element_count);
        for i in 0..element_count {
            let offset = i * element_size;
            let mut bytes = [0u8; std::mem::size_of::<T>()];
            bytes.copy_from_slice(&byte_slice[offset..offset + element_size]);
            result.push(bytemuck::from_bytes::<T>(&bytes).clone());
        }
        result
    }
};
```

**Pros:**
- Zero-copy when aligned (common case)
- Safe fallback when unaligned
- No API changes required

**Cons:**
- Slower fallback path
- Requires `T: Clone`

---

## Fix 2: Pre-aligned Buffer Allocation

**Approach:** Ensure staging buffer is aligned at creation time.

```rust
// In memory.rs, modify create_buffer_empty_bytes:

pub fn create_buffer_empty_bytes(
    &self,
    size: u64,
    usage: wgpu::BufferUsages,
    label: Option<&str>,
) -> Result<GpuBuffer> {
    // Ensure size is aligned to maximum possible type alignment (16 bytes for SIMD)
    const MAX_ALIGN: u64 = 16;
    let aligned_size = ((size + MAX_ALIGN - 1) / MAX_ALIGN) * MAX_ALIGN;

    let buffer = self.device.create_buffer(&wgpu::BufferDescriptor {
        label,
        size: aligned_size,
        usage,
        mapped_at_creation: false,
    });

    Ok(GpuBuffer::new_from_raw(buffer, aligned_size, usage, label))
}
```

**Pros:**
- Addresses root cause
- No runtime overhead

**Cons:**
- May waste some memory (up to 15 bytes per buffer)
- Doesn't guarantee mapped region alignment (GPU driver dependent)

---

## Fix 3: Use `bytemuck::allocation` with Aligned Vec

**Approach:** Copy to an aligned Vec before casting.

```rust
// Add dependency: bytemuck = { version = "1.23", features = ["extern_crate_alloc"] }

use bytemuck::allocation::zeroed_vec;

// In read_buffer, replace the problematic cast:
let byte_slice = &mapped_data[..element_count * element_size];

// Allocate aligned destination
let mut result_data: Vec<T> = zeroed_vec(element_count);

// Copy bytes (handles alignment automatically)
let dest_bytes: &mut [u8] = bytemuck::cast_slice_mut(&mut result_data);
dest_bytes.copy_from_slice(byte_slice);
```

**Pros:**
- Simple and clean
- Guaranteed alignment
- Uses bytemuck's allocation features

**Cons:**
- Extra allocation
- Requires `extern_crate_alloc` feature

---

## Fix 4: Offset-based Aligned Access

**Approach:** Find aligned offset within the buffer and adjust copy accordingly.

```rust
// In read_buffer:
let mapped_data = buffer_slice.get_mapped_range();
let ptr = mapped_data.as_ptr();
let align = std::mem::align_of::<T>();

// Calculate offset needed for alignment
let misalignment = (ptr as usize) % align;
let aligned_offset = if misalignment == 0 { 0 } else { align - misalignment };

// Adjust the copy operation
if aligned_offset > 0 {
    // Need to copy to aligned buffer
    let mut aligned_buffer: Vec<u8> = vec![0u8; element_count * element_size + align];
    let aligned_ptr = aligned_buffer.as_mut_ptr();
    let aligned_start = ((aligned_ptr as usize + align - 1) / align * align) as *mut u8;

    unsafe {
        std::ptr::copy_nonoverlapping(
            mapped_data.as_ptr(),
            aligned_start,
            element_count * element_size,
        );

        let result_data: Vec<T> =
            std::slice::from_raw_parts(aligned_start as *const T, element_count).to_vec();
    }
} else {
    // Already aligned
    let result_data: Vec<T> = cast_slice(&mapped_data[..element_count * element_size]).to_vec();
}
```

**Pros:**
- Optimal when aligned (zero-copy)
- Full control over memory layout

**Cons:**
- Complex and error-prone
- Uses unsafe code
- Hard to maintain

---

## Fix 5: Platform-specific Mapping Hints

**Approach:** Use wgpu features to request aligned mapping.

```rust
// When creating staging buffer, use mapped_at_creation with proper setup:
let staging_buffer = self.device.create_buffer(&wgpu::BufferDescriptor {
    label: Some("GPU Readback Staging (Aligned)"),
    size: aligned_size,
    usage: wgpu::BufferUsages::MAP_READ | wgpu::BufferUsages::COPY_DST,
    mapped_at_creation: true,  // Map at creation for alignment control
});

// Get mapped data immediately (should be aligned)
{
    let mapped = staging_buffer.slice(..).get_mapped_range_mut();
    // The buffer is now mapped and should be aligned
}
staging_buffer.unmap();

// Later, when reading:
// Re-map and use - the address should be consistent
```

**Pros:**
- Works with GPU driver's alignment
- No extra copies in ideal case

**Cons:**
- Platform-dependent behavior
- May not solve the issue on all drivers

---

## Recommended Implementation

**Use Fix 1** as the primary solution because:
1. It's safe and handles all cases
2. Zero runtime cost when aligned (common case)
3. Minimal code changes
4. No external dependencies

Here's the complete implementation:

```rust
// src/render/gpu/memory.rs

/// Read data back from GPU buffer (alignment-safe version)
pub fn read_buffer<T: Pod + Clone>(&self, buffer: &GpuBuffer) -> GpuResult<Vec<T>> {
    if !buffer.usage().contains(wgpu::BufferUsages::COPY_SRC) {
        return Err(GpuError::OperationFailed(
            "Buffer was not created with COPY_SRC usage".to_string(),
        ));
    }

    let element_size = std::mem::size_of::<T>();
    let element_count = (buffer.size() as usize) / element_size;

    // Create staging buffer for readback
    let staging_buffer = self
        .create_buffer_empty_bytes(
            buffer.size(),
            wgpu::BufferUsages::MAP_READ | wgpu::BufferUsages::COPY_DST,
            Some("GPU Readback Staging"),
        )
        .map_err(|e| GpuError::BufferCreationFailed(format!("{}", e)))?;

    // Copy from GPU buffer to staging buffer
    let mut encoder = self
        .device
        .create_command_encoder(&wgpu::CommandEncoderDescriptor {
            label: Some("GPU Buffer Copy"),
        });

    encoder.copy_buffer_to_buffer(
        buffer.buffer(),
        0,
        staging_buffer.buffer(),
        0,
        buffer.size(),
    );

    let submission = self.queue.submit(Some(encoder.finish()));

    // Map and read staging buffer
    let buffer_slice = staging_buffer.buffer().slice(..);
    let (sender, receiver) = futures_intrusive::channel::shared::oneshot_channel();
    buffer_slice.map_async(wgpu::MapMode::Read, move |result| {
        sender.send(result).ok();
    });

    // Wait for mapping to complete
    self.device
        .poll(wgpu::Maintain::WaitForSubmissionIndex(submission));

    pollster::block_on(receiver.receive())
        .ok_or_else(|| GpuError::OperationFailed("Buffer mapping failed".to_string()))?
        .map_err(|e| GpuError::OperationFailed(format!("Buffer mapping error: {:?}", e)))?;

    // Copy data with alignment-safe method
    let mapped_data = buffer_slice.get_mapped_range();
    let byte_slice = &mapped_data[..element_count * element_size];

    let result_data = Self::cast_slice_safe::<T>(byte_slice, element_count);

    // Unmap buffer
    drop(mapped_data);
    staging_buffer.buffer().unmap();

    Ok(result_data)
}

/// Alignment-safe slice casting with fallback
fn cast_slice_safe<T: Pod + Clone>(bytes: &[u8], element_count: usize) -> Vec<T> {
    let element_size = std::mem::size_of::<T>();

    // Try zero-copy cast first (fast path)
    if let Ok(aligned) = bytemuck::try_cast_slice::<u8, T>(bytes) {
        return aligned.to_vec();
    }

    // Fallback: manual byte-by-byte reconstruction
    // This is slower but handles unaligned data safely
    let mut result = Vec::with_capacity(element_count);

    for i in 0..element_count {
        let offset = i * element_size;
        let element_bytes = &bytes[offset..offset + element_size];

        // Create properly aligned temporary storage
        let mut aligned_bytes = vec![0u8; element_size];
        aligned_bytes.copy_from_slice(element_bytes);

        // Safe because aligned_bytes is properly aligned (heap allocation)
        let element: &T = bytemuck::from_bytes(&aligned_bytes);
        result.push(element.clone());
    }

    result
}
```

---

## Testing the Fix

```rust
#[cfg(test)]
mod alignment_tests {
    use super::*;

    #[test]
    fn test_cast_slice_safe_aligned() {
        // Aligned data
        let data: Vec<f32> = vec![1.0, 2.0, 3.0, 4.0];
        let bytes: &[u8] = bytemuck::cast_slice(&data);

        let result = GpuMemoryPool::cast_slice_safe::<f32>(bytes, 4);
        assert_eq!(result, vec![1.0, 2.0, 3.0, 4.0]);
    }

    #[test]
    fn test_cast_slice_safe_unaligned() {
        // Simulate unaligned by adding offset byte
        let mut bytes = vec![0u8]; // 1 byte offset
        bytes.extend_from_slice(bytemuck::cast_slice::<f32, u8>(&[1.0f32, 2.0, 3.0]));

        // Slice from offset 1 (unaligned for f32)
        let unaligned = &bytes[1..];

        // Should work via fallback path
        let result = GpuMemoryPool::cast_slice_safe::<f32>(unaligned, 3);
        assert_eq!(result, vec![1.0, 2.0, 3.0]);
    }
}
```