aprender-compute 0.32.0

High-performance SIMD compute library with GPU support, LLM inference engine, and GGUF model loading (was: trueno)
Documentation
# PTX Best Practices

This document covers PTX assembly generation best practices learned from
development and debugging of trueno-gpu CUDA kernels.

## Register Types

### U8 Registers Are Not Supported

**Issue**: PTX does not support 8-bit register types (`.u8`, `.s8`).

**Incorrect**:
```ptx
.reg .u8 %rs<1>;  // ERROR: Invalid register type
ld.global.u8 %rs0, [%rd0];
```

**Correct**:
```ptx
.reg .u16 %rh<1>;  // Minimum register size is 16-bit
ld.global.u8 %rh0, [%rd0];  // Load zero-extends to 16-bit
```

The `ld.global.u8` instruction is valid, but it must store into a 16-bit
or larger register. The loaded byte is zero-extended.

## Half-Precision (F16) Operations

### Loading F16 Values

**Issue**: PTX uses `.b16` (binary 16-bit) for half-precision loads, not `.f16`.

**Incorrect**:
```ptx
ld.global.f16 %h0, [%rd0];  // ERROR: Invalid type for load
```

**Correct**:
```ptx
ld.global.b16 %h0, [%rd0];  // Load 16-bit binary value
```

### F16 to F32 Conversion

**Issue**: Converting from f16 to f32 is exact and does NOT require a rounding modifier.

**Incorrect**:
```ptx
cvt.rn.f32.f16 %f0, %h0;  // ERROR: Illegal rounding modifier
```

**Correct**:
```ptx
cvt.f32.f16 %f0, %h0;  // No rounding needed (exact conversion)
```

Note: The reverse conversion (f32 → f16) DOES require a rounding modifier:
```ptx
cvt.rn.f16.f32 %h0, %f0;  // Correct: rounding needed for narrowing
```

## Bitwise Operations

### AND, OR, XOR Types

**Issue**: PTX requires `.b32` (binary) type for bitwise operations, not `.u32`.

**Incorrect**:
```ptx
and.u32 %r2, %r0, %r1;  // ERROR: Invalid type
or.u32 %r2, %r0, %r1;   // ERROR: Invalid type
```

**Correct**:
```ptx
and.b32 %r2, %r0, %r1;  // Use .b32 for bitwise ops
or.b32 %r2, %r0, %r1;
xor.b32 %r2, %r0, %r1;
```

## Warp Shuffle Operations

### Shuffle Width Parameter

**Issue**: The `width` parameter in `shfl.sync.idx` must be a power of 2 (1, 2, 4, 8, 16, or 32).

**Incorrect**:
```ptx
shfl.sync.idx.b32 %f0, %f1, 0, 31, 0xFFFFFFFF;  // ERROR: 31 is not power of 2
```

**Correct**:
```ptx
shfl.sync.idx.b32 %f0, %f1, 0, 32, 0xFFFFFFFF;  // 32 is valid
```

### Warp Participation

**Issue**: `shfl.sync` with mask `0xFFFFFFFF` requires ALL 32 threads in the warp
to execute the instruction simultaneously.

If some threads exit early (e.g., via `@%p bra exit`), the remaining threads
cannot perform shuffles.

**Solution**: Use address clamping to ensure all threads access valid memory,
then skip only the final store for out-of-bounds threads:

```ptx
// Clamp addresses for all threads
min.u32 %r_clamped_row, %r_global_row, %r_m_minus_1;
min.u32 %r_clamped_col, %r_global_col, %r_n_minus_1;

// All threads participate in computation and shuffles
// ...shuffle reduction code...

// Only in-bounds threads store
@%p_row_oob bra exit;
@%p_col_oob bra exit;
st.global.f32 [%rd_out], %f_result;
exit:
    ret;
```

## Memory Alignment

### 4-Byte Alignment for U32 Loads

**Issue**: `ld.global.u32` requires the address to be 4-byte aligned.

**Incorrect**:
```ptx
// If header has 2-byte f16 scale at offset 0, and we try to read
// another u32 at offset 2, it will be misaligned
add.u64 %rd1, %rd0, 2;
ld.global.u32 %r0, [%rd1];  // ERROR: Misaligned access
```

**Correct**: Use smaller loads for misaligned data:
```ptx
ld.global.b16 %rh0, [%rd0];  // Load 2-byte aligned data
```

## Testing PTX

Always validate generated PTX with `ptxas`:

```bash
ptxas -arch=sm_89 -v kernel.ptx -o kernel.cubin
```

Use `compute-sanitizer` for runtime memory access checking:

```bash
compute-sanitizer --tool memcheck ./your_program
```

## References

- [PTX ISA Reference]https://docs.nvidia.com/cuda/parallel-thread-execution/
- [GitHub Issue #67]https://github.com/paiml/trueno/issues/67 - U8 register bug
- [GitHub Issue #68]https://github.com/paiml/trueno/issues/68 - F16 load/convert bug